|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.georgetown.gucs.experiment.DuplicateFileFinder
public class DuplicateFileFinder
Implements iMatch to find a baseline of similar files in a dataset and inserts the matches into a database. Files that contain zero tokens after processing are ignored.
See: Abdur Chowdhury, Ophir Frieder, David Grossman, and Mary Catherine McCabe. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20, 2 (April 2002), 171-191. http://doi.acm.org/10.1145/506309.506311
Constructor Summary | |
---|---|
DuplicateFileFinder(Dictionary dictionary,
double min,
double max,
java.lang.String directory,
java.lang.String dataset)
Constructor that sets all necessary information and finds the duplicates |
Method Summary | |
---|---|
void |
addHash(java.io.File file,
java.lang.String hash)
Inserts the file and its hash into a hashMap of duplicates |
int |
countDuplicates()
|
void |
findDuplicates(java.lang.String directory)
|
double |
getMaxIDF()
Provides the maximum normalized IDF to kept in the dictionary |
double |
getMinIDF()
Provides the minimum normalized IDF to kept in the dictionary |
void |
insertDuplicatesInDB()
|
static void |
main(java.lang.String[] args)
Finds similar files in the dataset using iMatch and inserts the matches into the database to use as a baseline. |
void |
recursiveFindDuplicates(java.lang.String directory)
|
void |
setMaxIDF(double maxIDF)
Sets the maximum normalized IDF to keep in the dictionary |
void |
setMinIDF(double minIDF)
Sets the minimum normalized IDF to keep in the dictionary |
void |
showDuplicates()
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DuplicateFileFinder(Dictionary dictionary, double min, double max, java.lang.String directory, java.lang.String dataset)
dictionary
- the dictionary of terms to usemin
- the double
minimum normalized IDF to keep; any token with a lower IDF will be discardedmax
- the double
maximum normalized IDF to keep; any token with a higher IDF will be discardeddirectory
- the path to the directory containing files to be compareddataset
- the name of the database to insert matches intoMethod Detail |
---|
public void addHash(java.io.File file, java.lang.String hash)
file
- the file to inserthash
- the string hash to insertpublic int countDuplicates()
int
public void findDuplicates(java.lang.String directory)
directory
- the path to the directory containing files to be comparedpublic double getMaxIDF()
double
maximum normalized IDF to keptpublic double getMinIDF()
double
minimum normalized IDF to keptpublic void insertDuplicatesInDB()
public void recursiveFindDuplicates(java.lang.String directory)
directory
- the path to the directory containing files to be comparedpublic void setMaxIDF(double maxIDF)
maxIDF
- the double
maximum normalized IDF to keptpublic void setMinIDF(double minIDF)
minIDF
- the double
minimum normalized IDF to keptpublic void showDuplicates()
public static void main(java.lang.String[] args)
args
- array of string command line argumentsargs[0]
the filename containing the dictionary to loadargs[1]
the name of the database to insert matches into args[2]
the directory
path containing the files to be compared args[3]
the minimum normalized inverse document
frequency (IDF) to keep a tokenargs[4]
the maximum normalized inverse document frequency (IDF) to keep a tokenargs[5]
the option to print ("show") matches to the screen or insert ("insert") matches into
the database
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |