edu.georgetown.gucs.experiment
Class DuplicateFileFinder

java.lang.Object
  extended by edu.georgetown.gucs.experiment.DuplicateFileFinder

public class DuplicateFileFinder
extends java.lang.Object

Implements iMatch to find a baseline of similar files in a dataset and inserts the matches into a database. Files that contain zero tokens after processing are ignored.

See: Abdur Chowdhury, Ophir Frieder, David Grossman, and Mary Catherine McCabe. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20, 2 (April 2002), 171-191. http://doi.acm.org/10.1145/506309.506311

Author:
Clay Shields

Constructor Summary
DuplicateFileFinder(Dictionary dictionary, double min, double max, java.lang.String directory, java.lang.String dataset)
          Constructor that sets all necessary information and finds the duplicates
 
Method Summary
 void addHash(java.io.File file, java.lang.String hash)
          Inserts the file and its hash into a hashMap of duplicates
 int countDuplicates()
           
 void findDuplicates(java.lang.String directory)
           
 double getMaxIDF()
          Provides the maximum normalized IDF to kept in the dictionary
 double getMinIDF()
          Provides the minimum normalized IDF to kept in the dictionary
 void insertDuplicatesInDB()
           
static void main(java.lang.String[] args)
          Finds similar files in the dataset using iMatch and inserts the matches into the database to use as a baseline.
 void recursiveFindDuplicates(java.lang.String directory)
           
 void setMaxIDF(double maxIDF)
          Sets the maximum normalized IDF to keep in the dictionary
 void setMinIDF(double minIDF)
          Sets the minimum normalized IDF to keep in the dictionary
 void showDuplicates()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DuplicateFileFinder

public DuplicateFileFinder(Dictionary dictionary,
                           double min,
                           double max,
                           java.lang.String directory,
                           java.lang.String dataset)
Constructor that sets all necessary information and finds the duplicates

Parameters:
dictionary - the dictionary of terms to use
min - the double minimum normalized IDF to keep; any token with a lower IDF will be discarded
max - the double maximum normalized IDF to keep; any token with a higher IDF will be discarded
directory - the path to the directory containing files to be compared
dataset - the name of the database to insert matches into
Method Detail

addHash

public void addHash(java.io.File file,
                    java.lang.String hash)
Inserts the file and its hash into a hashMap of duplicates

Parameters:
file - the file to insert
hash - the string hash to insert

countDuplicates

public int countDuplicates()
Returns:
the int

findDuplicates

public void findDuplicates(java.lang.String directory)
Parameters:
directory - the path to the directory containing files to be compared

getMaxIDF

public double getMaxIDF()
Provides the maximum normalized IDF to kept in the dictionary

Returns:
the double maximum normalized IDF to kept

getMinIDF

public double getMinIDF()
Provides the minimum normalized IDF to kept in the dictionary

Returns:
the double minimum normalized IDF to kept

insertDuplicatesInDB

public void insertDuplicatesInDB()

recursiveFindDuplicates

public void recursiveFindDuplicates(java.lang.String directory)
Parameters:
directory - the path to the directory containing files to be compared

setMaxIDF

public void setMaxIDF(double maxIDF)
Sets the maximum normalized IDF to keep in the dictionary

Parameters:
maxIDF - the double maximum normalized IDF to kept

setMinIDF

public void setMinIDF(double minIDF)
Sets the minimum normalized IDF to keep in the dictionary

Parameters:
minIDF - the double minimum normalized IDF to kept

showDuplicates

public void showDuplicates()

main

public static void main(java.lang.String[] args)
Finds similar files in the dataset using iMatch and inserts the matches into the database to use as a baseline. Any files that contain zero tokens after processing are ignored, though an error is generated.

Parameters:
args - array of string command line arguments
args[0] the filename containing the dictionary to load
args[1] the name of the database to insert matches into args[2] the directory path containing the files to be compared args[3] the minimum normalized inverse document frequency (IDF) to keep a token
args[4] the maximum normalized inverse document frequency (IDF) to keep a token
args[5] the option to print ("show") matches to the screen or insert ("insert") matches into the database