Fall 2005

Clay Shields


front | classes | research | personal | contact

Project 2 - Index.dat Investigator

Assigned: Oct. 3
Design due: Oct. 17th
Program source code due: Oct. 24th

Index.dat Investigator


Very often, computer forensic investigations are intended to answer some question about what a particular user has been doing on the system, and the forensic investigator gathers information to support or refute some hypothesis about the user's behavior. Forensic investigators can look examine many files to find evidence. One common technique is to examine the user's internet history to determine what web sites were looked at, and when they were visited. For a longer description of what can be found, take a look at this article.

For this project, we are going to write our own internet history parsing tool. The files we will be reading will consist of lines in the form:

<type> <time> <date> <domain> <file> <cache>

where:

<type> will always be the string "URL"

<time> is the local time on the users computer

<date> is the date the web page was visited

<domain> is the web site visited, for example:
www.cs.georgetown.edu

<file> is the file retreived from that web site, such as:
~clay/071/

<cache> is a string indicating to the web browser where on the user's computer a copy of the web page is saved

It is important to note that the computer adds to the bottom of the file as the user visits web pages, so that the times and dates will always be sorted in descending order. Also, since the computer may not be used every day, the dates may not be continuous.

Your assignment

You are to write a program that will ask for the name of an index file to read and for a particular web domain that you are interested in. The program will calculate the total number of different days in the index file, the total number of visits to that web domain, and the percentage of days in the file that the specified domain was visited. This will help investigators understand the user's behavior and if the user was a frequent visitor to a particular site and how many files they looked at there. Below is the output of a program that does this:
gusun% ./index
This program computes how often a particular URL is visited
on a daily basis. The inputs are the index.dat file to be
tested and the URL of interest. The output is the total
number of days covered in the file, the number of visits
to the specified URL, and the percentage of days the
specified URL was visited.

Please enter the filename to be investigated: small.dat

Please enter the URL to be searched for: www.cnn.com

Total days covered: 9
Total visits to www.cnn.com: 4
Percent of days that www.cnn.com was visited: 33.33%

You can run the version of the program that does this by logging onto gusun and typing ~clay/index. There are also some sample data files that you can use by copying them over to your directory. To do so, type:

cp ~clay/*.dat ~/

A short file is named small.dat is available for you to look at; a longer one will be made available after the design documents are in.

UPDATE

I have added another file you can test your program with. You can use it copying them over to your directory. To do so, type:

cp ~clay/long.dat ~/

and it will appear in your directory as long.dat. Or you can click here for it.

Part 1 - Design Document

For the first part you are to submit a design document showing the algorithm you plan to implement.DO NOT SUBMIT A FLOWCHART. I won't grade it. Instead, write it out neatly using a language which is similar to that from Homework 1 and has the following terms:
  1. input
  2. output
  3. calculate
  4. if condition, then statement
  5. if condition, then statement; otherwise, statement
  6. while condition, do statement
  7. start
  8. stop
If you need to group multiple statements together, say in an if or while statement, use the following structure, including indentation:
    begin
       statement
       ...
       statement
    end
For calculate, you may only use the expressions we have covered in class.

A copy of your algorithm is due in class. Be sure to keep a copy for yourself!

Part 2 - Program Source Code

Important: Your output and input should be very similar to that shown in the example program. Please ask for the input in exactly the same order shown and only request the same items shown - do not ask for any other input. This will assist in grading your program.

Include the following header in your source code.

//
// Project 1
// Name: <your name>
// E-mail: <your e-mail address>
// COSC 071
//
// In accordance with the class policies and Georgetown's Honor Code,
// I certify that I have neither given nor received any assistance
// on this project with the exceptions of the lecture notes and those
// items noted below.
//
// Description: <Describe your program>
//

You will submit your source code using the submit program. This is the .cc file. Do not submit the compiled version! I don't speak binary very well.

To submit your program, make sure there is a copy of the source code on your account on gusun. You may name your program what you like - let's assume that it is called index.cc. To submit your program electronically, use the submit program like we did in Homework 2 and Project 1, but with the command:

submit -a p2 -f index.cc

I will not be enabling the electronic submission until after the design documents are in, so don't try to submit too early.

Bonus challenge section

You don't have to do this, but if you want a challenge, you can try to modify your program to count all visits to a particular domain. For example, a visitor might go to any of the following domains:

explore.georgetown.edu
www.clusters.arc.georgetown.edu
www.cs.georgetown.edu
www.georgetown.edu
www1.georgetown.edu
www13.georgetown.edu

As our program is currently specified, we would have to type each of these individually into our programs to find the visits to each.

It might be useful to us to be able to find out all visits to any georgetown.edu site all at once. Try to modify your program so that if the user enters georgetown.edu as the domain, it finds any site that ends in georgetown.edu, regardless of what it starts with. If you do this, be sure to note that you did so in your header comments.