Fall 2005

Clay Shields

front | classes | research | personal | contact

Project 3 - Forensic Hash Analysis

Assigned: October 24th, 2005
Program source code due: November 9th, 2005

Forensic Hash Analysis

Updated - Oct 30, 9:30 PM
Computer geeks are lazy, but in a good way - they generally try to avoid unnecessary work. In computer forensics, they do this by using an automated process to find files that they already know about. One mechanism commonly used to do this is call hash analysis.

Simply put, a hash is a function that takes some data in the form of bits and returns a fixed-length string that is dependent on the data. Just to confuse things a little, the output of a hash function is also commonly called a hash. The cool thing about hash functions is that it is incredibly rare for two different bunches of bits to have the same hash. It is also insanely difficult to find data that will match a given hash output. There are some ways to find two different pieces of data that do produce the same hash output, though.

For a hash analysis, the examiner will develop a library of interesting hashes, or download databases if known hashes, such as the one created by NIST. They will then compute the hash of some interesting or suspect files from the case they are working on, and compare those with the known hashes to identify ones that match.

Your job for this project will be to write a program to help perform hash analyses. Because we have some specific stuff to learn, namely using functions and vectors, I am going to make you do a few specific things in your program, as described below. For this project, the files we are working with will have a specific format. The hash files will start with three lines of comments describing what the hashes are; each of these lines starts with a # sign. . All following lines of the file will have a hash, one to each line. An example is shown below.

#This file contains hashes of hacking tools
#Last updated 10/21/2005
#Clay Shields
The other type of files we will be working with will be a file that contains a list of file names and hashes. These are the files we want to compare to our hash set. They also have 3 lines of comments preceding them. A small example of this is:
# These are hashes of the
# ~clay/classes/f05/071/projects/p3 directory
# 10/21/2005 Clay Shields
a.out 6cf502e4a3b2a92334b43c7ebbd5adec
data_file 1938010c6305308c0ca8d8b0d8dc4969
hashes 8eb379c256416aa5c12a72ba39162101
hashes.cc 256ae342b7a3c47e44f42b86c06ff39c
known_hashes 0972956ce73314bec610b69efc9870f0
p3.html 28ae7f13bb9c0148dc746be63bfc9b23

Program Requirements

Since we are learning about functions, you will be required to use functions in you program. Since each function is its own little algorithm, it is best not to try and develop them all at once. Instead, you should develop and test each separately, and then combine them later. These instructions show you what you need to do, and provides some suggestions for this stepwise development.
Step 1
First, we need some functions to help us load the hashvectors and make sure they were loaded correctly. The prototypes for these functions are:
// Print out the contents of a vector of strings
// with each entry on its own line
void print_hash_vector(vector <string>);

// Ask for a file name, and load the hashes,
// and place them in a vector
vector <string> load_hashes();

Once you have written those, you can test them with this main function:

int main (){

vector<string> test;

test = load_hashes();
print_hash_vector (test);
return 0;
This should print out the list of hashes without the comments at the top.
Step 2
Now we need to write some functions to help us read the file that has the filenames and hashes. We will call this the case file. The prototype for these functions are:
// Prints the vector of files and the vector
// of the hashes of files side by side
void print_case_vectors(vector<string> , vector<string> );

// Asks for the case file name to read
// Reads the case file and stores all
// the file names in one vector and their
// hashes in a second vector
// Discards the comment lines
void load_case_file( vector<string> &, vector<string> &);

Once you have written those, you should use the main below to make sure that they both work by using this main function:

int main (){

vector<string> files, hashes;

load_case_file(files, hashes);
print_case_vectors (files, hashes);
return 0;
Step 3
Now that we can load the hashes and the case file, we need to write a function that will do the comparisons. What we want is a function that will go through all the hashes loaded from the case, find any that match a hash loaded from the hashes file, and print out the matches to the screen. We will also count how many matches there are and return that value. Below is a function prototype for that function. Notice that it

// This function takes three vectors:
// the vector of names from the case file
// the vector of hashes from the case file
// the vector of hashes from the hashes file.
// It outputs the matches, and also returns
// an integer with a count of the files that matched.
int compare_hashes_with_case (vector <string>,
vector <string>, vector <string>);
And, once again, here is a main function you can use to test it:
int main (){

vector<string> case_names, case_hashes, known_hashes;
int matches = 0;

known_hashes = load_hashes();
load_case_file(case_names, case_hashes);

matches = compare_hashes_with_case( case_names, case_hashes, known_hashes);
cout << "Number of matches: " << matches << endl;

return 0;
Step 4
Now we just have to write a main function that will let us do all this as a menu. You should read characters, sort them using a switch statement, and then call the right functions as needed. You can see my solution on line for how to do this, by following the directions below.
As with the last projects, you can copy my solution to your gusun account and play with it as needed. To do this, type:

cp ~clay/hashes ./

You can also copy over the small sample case file or the sample hash set by typing the following two commands:

cp ~clay/case ./
cp ~clay/hash_set ./

What to turn in

Important: Your output and input should be very similar to that shown in the example program. Please ask for the input in exactly the same order shown and only request the same items shown - do not ask for any other input. This will assist in grading your program.

Include the following header in your source code.

	    // Project 3
	    // Name: <your name>
	    // E-mail: <your e-mail address>
	    // COSC 071
	    // In accordance with the class policies and 
	    // Georgetown's Honor Code, I certify that I 
	    // have neither given nor received any assistance
	    // on this project with the exceptions of the
	    //  lecture notes and those
	    // items noted below.
	    // Description: <Describe your program>

You will submit your source code using the submit program. This is the .cc file. Do not submit the compiled version! I don't speak binary very well.

To submit your program, make sure there is a copy of the source code on your account on gusun. You may name your program what you like - let's assume that it is called hashes.cc. To submit your program electronically, use the submit program like we did in Homework 2 and Project 1 and 2, but with the command:

submit -a p3 -f hashes.cc


Notice that I am not requiring a design document. Woohoo for you! But now you have the chance to really dig a hole for yourself by waiting until too late to get started. So, the start early rule still applies. Start early!

Second, even though I don't require a design document, you still need to think about the design before you start coding. Coding as you design is the path to unhappiness, sleeplessness, frustration, and a 50% chance of scattered program bugs and falling grades. Think before you write! When you ask me or the TAs for help, the first thing we are going to ask is: what is your algorithm? Have one.