A B C D E F G H I L M N O P R S T U V X 

A

add(int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Adds an element to this set using OR operator; set grows as necessary to accommodate
addDictionaryToTrial(int, int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds this dictionary to this trial in this database
addDocument(Iterator<String>) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Adds a document to this dictionary.
addFile(File) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the ID for the file if it already exists in the file this database or creates a new ID and adds it
addFileMatch(File, File, double, double) - Method in class edu.georgetown.gucs.experiment.DBInterface
Checks if files match and if so adds files and their normalized min and max IDF to the file_match table
addManglerToTrial(int, int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds the mangler ID to the trial in this database
addNullManglerToTrial(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds a "null" mangler to position 1 of the trial_mangler table
addToFingerprintMatch(int, int, int, boolean, boolean, int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds fingerprint match information if the match is not already recorded in the fingerprint_match table of this database
addTokenizer(String) - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds a tokenizer to the tokenizer table in this database
addTokenizer(String) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Adds a tokenizer to the end of this list of tokenizers.
addTokenizerNames(List<String>) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Adds tokenizers to the end of this list of tokenizers
addTokenizers(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Reads an XML file that specifies which tokenizers to use for this dictionary
addTokenizers(String) - Method in class edu.georgetown.gucs.tokenizers.TokenizeFile
Adds tokenizers from an XML file to the list of tokenizers to use on this document
addTrialResults(int, int, int, int, int, int, double, double, double) - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds the results from a trial to this database
addTrialToExperiment(int, int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds the trial ID to the experiment in this database
and(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Performs AND operation
andInPlace(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Performs AND operation
AntlrBitSet - Class in edu.georgetown.gucs.utility
Creates a BitSet to replace BitSet.
AntlrBitSet() - Constructor for class edu.georgetown.gucs.utility.AntlrBitSet
Constructs a bitset of size one word (64 bits)
AntlrBitSet(long[]) - Constructor for class edu.georgetown.gucs.utility.AntlrBitSet
Constructs a clone of the static array of longs
AntlrBitSet(int) - Constructor for class edu.georgetown.gucs.utility.AntlrBitSet
Constructs a bitset of given size
AntlrBitSet(byte[]) - Constructor for class edu.georgetown.gucs.utility.AntlrBitSet
Constructor that converts and sets this bitset
ArabicFileTokenizer - Class in edu.georgetown.gucs.tokenizers
Splits contents of an Arabic text file into tokens using Apache Lucene's ArabicAnalyzer.
ArabicFileTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.ArabicFileTokenizer
 
asHex(byte[]) - Static method in class edu.georgetown.gucs.utility.Converter
Converts bytes to a hex string

B

base64Fingerprint - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
BITS - Static variable in class edu.georgetown.gucs.utility.AntlrBitSet
 
bits - Variable in class edu.georgetown.gucs.utility.AntlrBitSet
The actual data bits
bitSetToByteArray(BitSet) - Static method in class edu.georgetown.gucs.utility.Converter
Provides a byte array of at least length 1, with the most significant bit guaranteed not to be a 1 (since BitSet does not support sign extension); byte-ordering of the result is big-endian (most significant bit is in element 0); bit at index 0 of the bit set is assumed to be the least significant bit
BitVectorFingerprinter - Class in edu.georgetown.gucs.fingerprinter
Creates a fingerprint that consist of a bit vector representing the presence or absence of terms in a particular dictionary
BitVectorFingerprinter() - Constructor for class edu.georgetown.gucs.fingerprinter.BitVectorFingerprinter
Constructor that generates the fingerprint name, version, unique identifier (GUID), system identifier, and creating program.
BitVectorFingerprinter(String) - Constructor for class edu.georgetown.gucs.fingerprinter.BitVectorFingerprinter
Constructor that loads a dictionary and its tokenizers.
BoundedExecutor - Class in edu.georgetown.gucs.utility
Limits the number of threads to run at one time when running more tasks than there are cores
BoundedExecutor(Executor, int) - Constructor for class edu.georgetown.gucs.utility.BoundedExecutor
Constructor that sets the executor and semaphore upper bound
buildObject(String) - Static method in class edu.georgetown.gucs.utility.ObjectBuilder
Builds an object with given name and parameters
bulkFileAdd(File) - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds a file to the file table of this database
bulkFileMatchAdd(File, File, double, double) - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds file matches and their normalized min and max IDF to the file_match table
byteArrayToBitSet(byte[]) - Static method in class edu.georgetown.gucs.utility.Converter
Provides a bitset containing the values in a byte array; byte-ordering of the byte array must be big-endian (most significant bit in element 0)
byteArrayToIndices(byte[]) - Static method in class edu.georgetown.gucs.utility.Converter
Converts a byte array to a set of indices
byteRun - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 

C

calculateTrialResults() - Method in class edu.georgetown.gucs.experiment.Trial
Computes the precision, recall, and f-score for each mangler
changeRandomSeed(int, long) - Method in class edu.georgetown.gucs.experiment.DBInterface
Updates this database with the random seed to use for this trial
checkIndexing(String, int, int) - Method in class edu.georgetown.gucs.tokenizers.FileTokenizer
Prints to the screen a portion of the given text document
checkIndexing(String, int, int) - Method in class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
Prints to the screen a portion of the given gzipped text document
checkIndexing(String, int, int) - Method in class edu.georgetown.gucs.tokenizers.OutsideInFileTokenizer
Prints to the screen a portion of the given document
ChineseFileTokenizer - Class in edu.georgetown.gucs.tokenizers
Splits contents of a Chinese or Chinese-English text file into tokens using Apache Lucene's ChineseAnalyzer or SmartChineseAnalyzer.
ChineseFileTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.ChineseFileTokenizer
Constructor that sets the token creation mode to split by smart tokenization using probabilistic word segmentation
ChineseFileTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.ChineseFileTokenizer
Constructor that sets the token creation mode.
clear() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Clears this list of tokenizers
clear() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Clears all elements in this bitset
clear(int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Clears the given element in this bitset
clearExperimentResults(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Removes the experiment results from experiment_results table
clone(Dictionary) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Creates a deep copy of the given dictionary object and stores it in this dictionary
clone() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Creates and returns a copy of this antlrBitSet
close() - Method in class edu.georgetown.gucs.experiment.DBInterface
Closes this database
CommandLine - Class in sdtext
Provides a command line interface to run the different programs within SDTEXT (Similarity Digest Text)
CommandLine() - Constructor for class sdtext.CommandLine
 
CompareDirectory - Class in edu.georgetown.gucs.matcher
Outputs a list of files and their scores, comparing a given fingerprint to a directory of files.
CompareDirectory(String, String, String) - Constructor for class edu.georgetown.gucs.matcher.CompareDirectory
Constructor that sets the matcher, fingerprint file and directory to use for this comparison
CompareDirectory(String, String, String, int) - Constructor for class edu.georgetown.gucs.matcher.CompareDirectory
Constructor that sets the matcher, fingerprint file, directory and minimum score to use for this comparison
CompareDirectory(String, String, String, int, String) - Constructor for class edu.georgetown.gucs.matcher.CompareDirectory
Constructor that sets the matcher, fingerprint file, directory, minimum score and dictionary to use for this comparison
computeFingerprint(String) - Method in class edu.georgetown.gucs.fingerprinter.BitVectorFingerprinter
Computes a byte array fingerprint indicating the presence or absence of each token in this dictionary
computeFingerprint(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Computes the fingerprint of this document as a byte array; indicates the presence or absence of each token in this dictionary
computeFingerprints() - Method in class edu.georgetown.gucs.matcher.CompareDirectory
Compute a fingerprint for each file in this directory and compare it to this fingerprint using this matcher; if the matcher returns boolean values, true values are given a score of 99 and false values are given a score of 0
computeFingerprintXML(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Computes the fingerprint of a document as a Base64 encoded string; indicates the presence or absence of each token in this dictionary
containsToken(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Determines if this dictionary contains a particular token
Converter - Class in edu.georgetown.gucs.utility
Converts between bits, bytes, and hex
Converter() - Constructor for class edu.georgetown.gucs.utility.Converter
 
CosineSimilarityFingerprintMatcher - Class in edu.georgetown.gucs.matcher
Provides a score based on a cosine similarity match to be used with a BitVectorFingerprinter.
CosineSimilarityFingerprintMatcher() - Constructor for class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
Constructor that sets the minimum score to use for matching two fingerprints to zero
CosineSimilarityFingerprintMatcher(int) - Constructor for class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
Constructor that sets the minimum score to use for matching two fingerprints
CosineSimilarityFingerprintMatcher(String) - Constructor for class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
Constructor that sets the minimum score to use for matching two fingerprints.
countResults(int, int, int, boolean, boolean) - Method in class edu.georgetown.gucs.experiment.DBInterface
Counts the number of entries with the given manglers and matching results in the fingerprint_match table for a trial
createFromFileLister(FileLister) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Creates this dictionary using an optional number or percent of the files in the given fileLister; if no count or percent is specified then uses entire fileList to create dictionary
createFromFileLister(FileLister, double) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Creates this dictionary using an optional number or percent of the files in the given fileLister
createNewListFromCount(int) - Method in class edu.georgetown.gucs.utility.FileLister
Creates a new list of randomly chosen files of the given size from this fileLister
createNewListFromPercent(double) - Method in class edu.georgetown.gucs.utility.FileLister
Creates a new list of randomly chosen files that is a percent of the files from this fileLister
creatingProgram - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
creator - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 

D

DBInterface - Class in edu.georgetown.gucs.experiment
Creates a database of the experiment results
DBInterface(String) - Constructor for class edu.georgetown.gucs.experiment.DBInterface
Constructor that connects to a database
deepCopy(Object) - Static method in class edu.georgetown.gucs.utility.ObjectCloner
Provides a deep copy of an object
degree() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Returns the degree of this bitset
degreeOfDifference(byte[]) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Returns the number of bits that differ between the AntlrBitSet and the provided byte[] bitset
degreeOfSimilarity(byte[]) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Returns the number of bits that are the same between the antlrBitSet and the provided byte[] bitset
Dictionary - Class in edu.georgetown.gucs.dictionary
Creates a list of unique tokens extracted from a collection of documents that can be trimmed by removing tokens based on various different attributes; used for creating fingerprints of documents that are based on words that appear in a document collection
Dictionary() - Constructor for class edu.georgetown.gucs.dictionary.Dictionary
Constructor that sets default values for this dictionary.
Dictionary(String) - Constructor for class edu.georgetown.gucs.dictionary.Dictionary
Constructor that loads this dictionary from a file
dictionary - Variable in class edu.georgetown.gucs.dictionary.Dictionary
 
dictionary - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
DICTIONARY_SIZE - Static variable in class edu.georgetown.gucs.experiment.Global
 
DictionaryEntry - Class in edu.georgetown.gucs.dictionary
Statistics kept on a per-token basis within a Dictionary
dictionaryTokenIterator() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Allows for operations on the tokens in this dictionary
DictionaryWorker - Class in edu.georgetown.gucs.dictionary
Thread for processing a file to add to a Dictionary
DictionaryWorker(File, List<String>, Dictionary) - Constructor for class edu.georgetown.gucs.dictionary.DictionaryWorker
Constructor that specifies a file to tokenize and add to an existing dictionary
disableMangler() - Method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
Disables the mangler in this tokenizer
disableMangler() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Disables the mangler in the FileManglerTokenizer object in this list
diskImage - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 

E

edu.georgetown.gucs.dictionary - package edu.georgetown.gucs.dictionary
 
edu.georgetown.gucs.experiment - package edu.georgetown.gucs.experiment
 
edu.georgetown.gucs.fingerprinter - package edu.georgetown.gucs.fingerprinter
 
edu.georgetown.gucs.matcher - package edu.georgetown.gucs.matcher
 
edu.georgetown.gucs.tokenizers - package edu.georgetown.gucs.tokenizers
 
edu.georgetown.gucs.utility - package edu.georgetown.gucs.utility
 
enableMangler(String) - Method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
Enables the mangler with the given settings in this tokenizer
enableMangler(String, List<String>) - Method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
Enables the mangler with the given settings in this tokenizer
enableMangler(String) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Enables the mangler with the given settings in the FileManglerTokenizer object in this list
enableMangler(String, List<String>) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Enables the mangler with the given settings in the FileManglerTokenizer object in this list
endTraining() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Ends training, which prohibits adding new tokens to this dictionary and allows this dictionary to be written to a file.
EnronEmailTokenizer - Class in edu.georgetown.gucs.tokenizers
Eliminates tokens related to attachments in the Enron data set
EnronEmailTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.EnronEmailTokenizer
 
EnronStripMailHeaderTokenizer - Class in edu.georgetown.gucs.tokenizers
Eliminates tokens from headers in the Enron data set
EnronStripMailHeaderTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.EnronStripMailHeaderTokenizer
Constructor to remove tokens from headers in the Enron data set
equals(Object) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Returns true if the obj is a bit set that contains exactly the same elements as this bit set, otherwise false.
ExactFingerprintMatcher - Class in edu.georgetown.gucs.matcher
Determines if two fingerprints' documents are the same; uses a BitVectorFingerprinter
ExactFingerprintMatcher() - Constructor for class edu.georgetown.gucs.matcher.ExactFingerprintMatcher
Constructor that sets the minimum score to use for matching two fingerprints to zero
ExceptionHandler - Class in edu.georgetown.gucs.utility
Prints exception messages and stack traces before calling System.exit(-1)
ExceptionHandler() - Constructor for class edu.georgetown.gucs.utility.ExceptionHandler
 
executeFileMatch() - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds file matches to database
Experiment - Class in edu.georgetown.gucs.experiment
Runs many Trials to determine the best settings to use for Dictionary and Fingerprinter creation.
Experiment(String) - Constructor for class edu.georgetown.gucs.experiment.Experiment
Constructor that loads properties from XML configuration file
extractConfigFile(int, String) - Method in class edu.georgetown.gucs.experiment.DBInterface
Extracts a configuration file from an experiment in this database
extractDictionary(int, String) - Method in class edu.georgetown.gucs.experiment.DBInterface
Extracts a dictionary from a trial in this database
ExtractDictionary - Class in edu.georgetown.gucs.fingerprinter
Produces a Dictionary in XML format from a given Fingerprinter digest containing a dictionary.
ExtractDictionary() - Constructor for class edu.georgetown.gucs.fingerprinter.ExtractDictionary
 
extractDictionaryXML(String, String) - Method in class edu.georgetown.gucs.fingerprinter.ExtractDictionary
Produces an XML dictionary object from a fingerprint digest.

F

FALSE_NEGATIVE - Static variable in class edu.georgetown.gucs.experiment.Global
 
FALSE_POSITIVE - Static variable in class edu.georgetown.gucs.experiment.Global
 
fileExists(File) - Method in class edu.georgetown.gucs.experiment.DBInterface
Checks whether the file exists in the database
FileLister - Class in edu.georgetown.gucs.utility
Creates a list of files in a directory and allows iteration over all and parts of this list
FileLister() - Constructor for class edu.georgetown.gucs.utility.FileLister
Default constructor
FileLister(String) - Constructor for class edu.georgetown.gucs.utility.FileLister
Constructor that loads all the files from the given directory
FileManglerTokenizer - Class in edu.georgetown.gucs.tokenizers
Manipulates files as they are read.
FileManglerTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
Constructor that initializes the random number generator and clears the mangler settings
FileManglerTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
Constructor that sets the mangler settings
FileTokenizer - Class in edu.georgetown.gucs.tokenizers
Splits contents of a text file into tokens based on whitespace or by line.
FileTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.FileTokenizer
Constructor that sets the token creation mode to split based on whitespace
FileTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.FileTokenizer
Constructor that sets the token creation mode.
finalize() - Method in class edu.georgetown.gucs.experiment.DBInterface
Closes this database
Fingerprinter - Class in edu.georgetown.gucs.fingerprinter
This is the base class for all fingerprinters.
Fingerprinter() - Constructor for class edu.georgetown.gucs.fingerprinter.Fingerprinter
Constructor that generates the fingerprint name, version, unique identifier (GUID), system identifier, and creating program.
Fingerprinter(Dictionary) - Constructor for class edu.georgetown.gucs.fingerprinter.Fingerprinter
Constructor that loads a dictionary and its tokenizers.
Fingerprinter(String) - Constructor for class edu.georgetown.gucs.fingerprinter.Fingerprinter
Constructor that loads a dictionary and its tokenizers.
FingerprintMatcher - Class in edu.georgetown.gucs.matcher
This is the base class for all fingerprint matchers.
FingerprintMatcher() - Constructor for class edu.georgetown.gucs.matcher.FingerprintMatcher
Constructor that sets the minimum score to use for matching two fingerprints to zero
FingerprintMatcher(int) - Constructor for class edu.georgetown.gucs.matcher.FingerprintMatcher
Constructor that sets the minimum score to use for matching two fingerprints
fingerprintName - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
finishBulkFileAdd() - Method in class edu.georgetown.gucs.experiment.DBInterface
Rebuild indexes for file_full_path table
finishBulkFileMatchAdd() - Method in class edu.georgetown.gucs.experiment.DBInterface
Rebuild indexes for file_match table
finishBulkFingerprintInsertion() - Method in class edu.georgetown.gucs.experiment.DBInterface
Rebuild indexes for fingerprint_file_id, fingerprint_trial_id, and fingerprint_mangler_id tables
finishBulkFingerprintMatchInsertion() - Method in class edu.georgetown.gucs.experiment.DBInterface
Rebuilds indexes on the fingerprint_match tables
finishRun(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Finishes an experiment run
FSCORE - Static variable in class edu.georgetown.gucs.experiment.Global
 

G

generateCreatingProgram() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Determines the program that created this fingerprinter
generateXML(String, String, String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Generates an XML file that contains this fingerprinter's digest
getA() - Method in class edu.georgetown.gucs.utility.Pair
Provides the first object in this pair
getAllFingerprintsFromTrial(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides all the fingerprints and fingerprint IDs for a trial
getAllMatchingFileIDs(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides all the file IDs of files that have matches
getB() - Method in class edu.georgetown.gucs.utility.Pair
Provides the second object in this pair
getBase64Fingerprint() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Gives this fingerprinter's fingerprint in Base64 encoding
getCreatingProgram() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the program that created this dictionary
getCreatingProgram() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Gives the program that created this fingerprinter
getCreation() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the original creation date of this dictionary
getDataset() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides the dataset for this trial
getDictionary() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Gives this fingerprinter's dictionary
getDictionaryFilename() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the filename of this dictionary, if this dictionary was loaded from or saved to a file
getDictionaryFilename() - Method in class edu.georgetown.gucs.fingerprinter.ExtractDictionary
Provides the filename of this extracted dictionary
getDictionaryName() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the name of this dictionary
getDictionaryParameter() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides the dictionary count or percent for this trial
getDictionarySize() - Method in class edu.georgetown.gucs.experiment.Trial
Provides the size of the dictionary used
getDirectory() - Method in class edu.georgetown.gucs.utility.FileLister
Provides the path to the directory used for the initial list of files
getDocumentCount() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Provides the number of documents this token appears in
getExperimentDatasetPath(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the path of the dataset used in an experiment
getExperimentDescription(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the description for an experiment
getExperimentDictionarySetting(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the count or percent size of files used for the dictionary in an experiment
getExperimentFingerprinterName(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the fingerprinter name used in an experiment
getExperimentLanguage(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the language used in an experiment
getExperimentManglerSettings(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the mangler names and settings used in an experiment
getExperimentMatcherName(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the matcher name used in an experiment
getExperimentMatcherParameter(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the minimum score used for the matcher in an experiment
getExperimentMaximumIDF(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the maximum normalized IDF for an experiment's dictionary
getExperimentMinimumIDF(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the minimum normalized IDF for an experiment's dictionary
getExperimentSampleSetting(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the count or percent size of the sample of files used for an experiment
getExperimentTokenizers(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the tokenizer names used in an experiment
getExperimentTrialCount(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the number of trials in an experiment
getFileID(File) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the ID of the file or 0 if it is not in the this database
getFileTypes() - Method in class edu.georgetown.gucs.utility.FileLister
Provides the counts for each file extension that exists in this list of files
getFingerprint(String) - Method in class edu.georgetown.gucs.matcher.ScoreFingerprints
Extracts the Base64 encoded fingerprint string from the given XML fingerprint digest file
getFingerprinter() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides the fingerprinter for this trial
getFingerprintName() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Gives this fingerprinter's name.
getFrequency(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the number of times the given token appeared when creating this dictionary
getFrequencyCount() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Provides the frequency count for this entry
getGUID() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the unique identifier for this dictionary
getIDF(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the inverse document frequency (IDF) of the given token in this dictionary
getIDF() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Provides the IDF of this token
getIDF(int) - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Computes the IDF of this token using the totalDocuments
getIterator() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Provides the string iterator over this document's tokens from the last tokenizer from this list
getLanguage() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the language of this dictionary
getManglerID(String) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the ID for the existing mangler setting that matches or creates a new one
getManglerResults() - Method in class edu.georgetown.gucs.experiment.Trial
Provides the results for each mangler; sleeps until the trial is finished
getManglers() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides the manglers for this trial
getManglersForTrial(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides all the mangler IDs for a trial
getMatcher() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides the matcher for this trial
getMatcherName() - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
Provides the name of the matcher used to determine in the two fingerprints have matching documents
getMatcherScore() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides the matcher minimum score for this trial
getMatchingFileCount() - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the number of file matches
getMaxIDF() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the largest IDF value in this dictionary
getMaxIDF() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides the maximum IDF for this trial
getMean(List<Double>) - Static method in class edu.georgetown.gucs.experiment.Statistics
Provides the mean for the given values
getMinIDF() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides the minimum IDF for this trial
getMinimumScore() - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
Provides the minimum score for these two fingerprints to be considered a match.
getMode() - Method in class edu.georgetown.gucs.tokenizers.FileTokenizer
Provides the token creation mode
getNames() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Provides the ordered list of the tokenizer names in this list
getNewListIterator() - Method in class edu.georgetown.gucs.utility.FileLister
Provides an iterator over the updated list of files; need to call newCountIterator or newPercentIterator to update the list of files
getNormalizedIDF(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the normalized inverse document frequency (IDF) of the given token in this dictionary.
getNumberOfFiles() - Method in class edu.georgetown.gucs.utility.FileLister
Provides the size of the initial list of files
getOriginalFingerprintsFromTrial(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the fingerprints and fingerprint IDs with null manglers for a trial
getPosition(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the position of the given token in this dictionary
getPosition() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Provides the position of the token in this dictionary.
getPositionsVector() - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
Provides the list of integer pairs corresponding to the position of each token
getRandomToken() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Gives a random token from this dictionary
getSampleParameter() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides the sample count or percent for this trial
getScore(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
Determines the cosine similarity score for these two fingerprints
getScore(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.ExactFingerprintMatcher
Determines a similarity score for these two fingerprints
getScore(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
Determines a similarity score for these two fingerprints.
getScoreXML(String, String) - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
Determines a similarity score for these two fingerprints.
getSize() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the number of tokens in this dictionary
getSource() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the directory used for this dictionary
getStandardDeviation(List<Double>) - Static method in class edu.georgetown.gucs.experiment.Statistics
Provides the standard deviation for the given values
getStandardDeviation(List<Double>, double) - Static method in class edu.georgetown.gucs.experiment.Statistics
Provides the standard deviation for the given values
getSystemID() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the identifier for the system that this dictionary was created on
getToken() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Provides the name of the token
getTokenizerID(String) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides the ID of the tokenizer or 0 if it is not in the tokenizer table of this database
getTokenizerNames() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the names of the tokenizers used to create this dictionary
getTokenizers() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides the tokenizers for this trial
getTokenizers() - Method in class edu.georgetown.gucs.tokenizers.TokenizeFile
Provides the list of tokenizers to use on this document
getTokenizers() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Provides the ordered list of the tokenizers in this list
getTokens() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides a vector of the tokens in this dictionary
getTokenVector() - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
Provides the list of each token in order of its appearance
getTotalDocuments() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the number of documents processed for this dictionary
getTrialFileIDs(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Provides all the file IDs and whether they have matches in a trial
getTrialParameters() - Method in class edu.georgetown.gucs.experiment.Trial
Provides the parameters for this trial that are needed for results comparisons
getTValueNinetyFive(int) - Static method in class edu.georgetown.gucs.experiment.Statistics
Provides a value for the confidence interval
getVersion() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Provides the version number of this dictionary.
Global - Class in edu.georgetown.gucs.experiment
Global information for an Experiment
Global() - Constructor for class edu.georgetown.gucs.experiment.Global
 
growToInclude(int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Grows the set to a larger number of bits
GUID - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
GzippedFileTokenizer - Class in edu.georgetown.gucs.tokenizers
Splits contents of a gzipped text file into tokens based on whitespace or by line.
GzippedFileTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
Constructor that sets the token creation mode to split based on whitespace
GzippedFileTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
Constructor that sets the token creation mode.

H

handleException(Exception) - Static method in class edu.georgetown.gucs.utility.ExceptionHandler
Prints the message and stack trace for this Exception
handleException(InternalError) - Static method in class edu.georgetown.gucs.utility.ExceptionHandler
Prints the message and stack trace for this InternalError
handleSQLException(SQLException) - Static method in class edu.georgetown.gucs.utility.ExceptionHandler
Prints the message, SQLState, vendor-specific exception code, and stack trace for this SQLException
handleSQLException(SQLException, Connection, Statement) - Static method in class edu.georgetown.gucs.utility.ExceptionHandler
Prints the message, SQLState, vendor-specific exception code, and stack trace for this SQLException
hashCode() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Returns a hash code value for this bit set.

I

incrementDocumentCount() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Increases the number of documents this token has been seen in by 1
incrementDocumentCount(int) - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Increases the number of documents this token has been seen in by amount
incrementFrequencyCount() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Increases the number of times this token has been seen by 1
incrementFrequencyCount(int) - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Increases the number of times this token has been seen by amount
insertExperimentResults(int, int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Inserts experiment results into database; computes average, standard deviation and error for precision, recall, fscore and dictionary size
insertFingerprint(File, byte[], int, int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Inserts fingerprint information into the database and returns the corresponding fingerprint ID
iterator() - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
Provides an iterator over the tokens in this tokenizer
iterator() - Method in class edu.georgetown.gucs.utility.FileLister
Provides an iterator over the list of files

L

lengthInLongWords() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Returns the amount of space being used by the bits array (not the number of member bits on)
loadDictionary(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Loads a previously serialized dictionary object into this dictionary
loadDictionaryXML(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Loads a previously serialized XML dictionary object into this dictionary
loadDirectory(String) - Method in class edu.georgetown.gucs.utility.FileLister
Loads all the files from the given directory into the initial list of files
loadFingerprintXML(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Loads a previously generated fingerprinter digest from a fingerprint XML file.
loadFingerprintXML(String, String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Loads a previously generated fingerprinter digest from fingerprinter and dictionary XML files.
loadStopWords(String) - Method in class edu.georgetown.gucs.tokenizers.StopWordRemoverTokenizer
Loads words to eliminate when tokenize method is called
loadXMLConfig(String) - Method in class edu.georgetown.gucs.experiment.Experiment
Load properties from XML configuration file
LOG_BITS - Static variable in class edu.georgetown.gucs.utility.AntlrBitSet
 

M

main(String[]) - Static method in class edu.georgetown.gucs.dictionary.Dictionary
Creates a dictionary from the specified directory using the provided tokenizers
main(String[]) - Static method in class edu.georgetown.gucs.dictionary.ShowDictionaryStatistics
Prints statistics for a dictionary, including name, language, number of documents, number of tokens, minimum and maximum IDF, list of tokenizers, and whether the dictionary has been trimmed
main(String[]) - Static method in class edu.georgetown.gucs.dictionary.ShowDictionaryTokens
Prints a dictionary's tokens and their frequencies, IDFs, and normalized IDFs
main(String[]) - Static method in class edu.georgetown.gucs.dictionary.TrimDictionary
Trims a dictionary by removing any token that is outside a range of normalized IDFs
main(String[]) - Static method in class edu.georgetown.gucs.experiment.Experiment
Runs experiments using multiple runs with different settings and generates results
main(String[]) - Static method in class edu.georgetown.gucs.fingerprinter.BitVectorFingerprinter
Creates a bit vector fingerprint representing the presence or absence of terms in a particular dictionary
main(String[]) - Static method in class edu.georgetown.gucs.fingerprinter.ExtractDictionary
Extracts a dictionary from a fingerprint digest and writes it to a new dictionary file
main(String[]) - Static method in class edu.georgetown.gucs.matcher.CompareDirectory
Outputs a list of files and their scores, comparing a given fingerprint to a directory of files.
main(String[]) - Static method in class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
 
main(String[]) - Static method in class edu.georgetown.gucs.matcher.ExactFingerprintMatcher
 
main(String[]) - Static method in class edu.georgetown.gucs.matcher.ScoreFingerprints
Compares two XML fingerprints and provides a score for their degree of similarity.
main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.ArabicFileTokenizer
Tokenizes an Arabic text file and prints the resulting tokens to the the screen
main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.ChineseFileTokenizer
Tokenizes a Chinese text file into single characters and using a probabilistic model and prints both sets of resulting tokens to the the screen
main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
 
main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.FileTokenizer
Tokenizes a text file and prints the resulting tokens to the the screen
main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
Tokenizes a gzipped text file and prints the resulting tokens to the the screen
main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.OutsideInFileTokenizer
Tokenizes a file using OutsideIn and prints the resulting tokens to the the screen
main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.TokenizeFile
Tokenizes a text file and prints the resulting tokens to the the screen
main(String[]) - Static method in class sdtext.CommandLine
Provides a command line interface to run the different SDTEXT programs
makeDictionary(String, String, String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Creates dictionary from the given directory; Reads an XML file that specifies which tokenizers to use and outputs dictionary to a file
match(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
Determines that the two fingerprints are matching if their similarity score is at or above the minimum score for this fingerprinter
match(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.ExactFingerprintMatcher
Determines that the two fingerprints are matching if their byte arrays are equal
match(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
Determines if the two fingerprints' documents match.
match(String, String) - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
Determines if these two fingerprints' documents match.
matcherName - Variable in class edu.georgetown.gucs.matcher.FingerprintMatcher
 
matches(File) - Method in class edu.georgetown.gucs.experiment.DBInterface
Matches the given file against the files in this database
MaximumLengthTokenizer - Class in edu.georgetown.gucs.tokenizers
Eliminates tokens that are longer than a given length
MaximumLengthTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.MaximumLengthTokenizer
Constructor that sets the maximum token length to be considered
member(int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
TODO
mergeDictionary(Map<String, DictionaryEntry>) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Merges the given Map of tokens to dictionaryEntry with this dictionary
minimum_score - Variable in class edu.georgetown.gucs.matcher.FingerprintMatcher
 
MinimumLengthTokenizer - Class in edu.georgetown.gucs.tokenizers
Eliminates tokens that are shorter than a given length
MinimumLengthTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.MinimumLengthTokenizer
Constructor that sets the minimum token length to be considered
MOD_MASK - Static variable in class edu.georgetown.gucs.utility.AntlrBitSet
A precomputed mod mask.
mode - Variable in class edu.georgetown.gucs.tokenizers.FileTokenizer
The string specifying to split tokens based on whitespace ("tokens") or by line ("lines")

N

newDictionary(Dictionary) - Method in class edu.georgetown.gucs.experiment.DBInterface
Inserts the compressed dictionary into blob space and stores the size and OID for it in the database
newExperiment(int, String, String, String, String, String, String, int, double, double, double, double, String) - Method in class edu.georgetown.gucs.experiment.DBInterface
Creates a new experiment
newTrial(long) - Method in class edu.georgetown.gucs.experiment.DBInterface
Creates a new trial
nextPosition - Variable in class edu.georgetown.gucs.dictionary.Dictionary
 
NIBBLE - Static variable in class edu.georgetown.gucs.utility.AntlrBitSet
 
nil() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Returns whether this bitset is empty
not() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Provides the bitwise complement of this bitset
notInPlace() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Performs a bitwise complement to invert the bits in this bitset
notInPlace(int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Performs a bitwise complement to invert the bits from zero to maxBit
notInPlace(int, int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Performs a bitwise complement to invert the bits from minBit to maxBit

O

ObjectBuilder - Class in edu.georgetown.gucs.utility
Builds an Object of the given name
ObjectBuilder() - Constructor for class edu.georgetown.gucs.utility.ObjectBuilder
 
ObjectCloner - Class in edu.georgetown.gucs.utility
Makes a deep copy of an Object using serialization
of(int) - Static method in class edu.georgetown.gucs.utility.AntlrBitSet
Creates a antlrBitSet with the given element
or(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Performs OR operation
orInPlace(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Performs OR operation
outputFields(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Sets the digest output to generate for this fingerprinter
OutsideInFileTokenizer - Class in edu.georgetown.gucs.tokenizers
Uses Oracle's OutsideIn technology to extract text from a file.
OutsideInFileTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.OutsideInFileTokenizer
Constructor that sets the token creation mode to split based on whitespace
OutsideInFileTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.OutsideInFileTokenizer
Constructor that sets the token creation mode.

P

Pair<A,B> - Class in edu.georgetown.gucs.utility
Creates an ordered pair
Pair(A, B) - Constructor for class edu.georgetown.gucs.utility.Pair
Constructor that sets the both objects in this pair
PorterTokenizer - Class in edu.georgetown.gucs.tokenizers
Uses English stemming to change tokens into their root form
PorterTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.PorterTokenizer
 
position_iterator() - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
Provides an iterator over the positions of elements in this tokenizer
positions - Variable in class edu.georgetown.gucs.tokenizers.Tokenizer
The list of integer pairs corresponding to the position of each token
PRECISION - Static variable in class edu.georgetown.gucs.experiment.Global
 
printExperimentResults() - Method in class edu.georgetown.gucs.experiment.Experiment
Prints the results of all trials, grouped by parameters and displayed by mangler
printFiles() - Method in class edu.georgetown.gucs.utility.FileLister
Prints each file from the initial list of files to standard output
printMatch() - Method in class edu.georgetown.gucs.matcher.ScoreFingerprints
Depending on the matcher, prints either the boolean match or integer similarity of these two fingerprints
printScores() - Method in class edu.georgetown.gucs.matcher.CompareDirectory
Print all the files and their similarity scores; if a minimum score is provided, only output those files with scores greater than or equal to the minimum
printTokenizerNames() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Prints the ordered list of the tokenizer names in this list
printTokens(Iterator<String>) - Static method in class edu.georgetown.gucs.tokenizers.Tokenizer
Prints each token to the system output stream using the UTF-8 charset
printTrialResults() - Method in class edu.georgetown.gucs.experiment.Experiment
Calculates the results of all trials, grouped by parameters and displayed by mangler
progress - Static variable in class edu.georgetown.gucs.experiment.Global
 
ProgressMeter - Class in edu.georgetown.gucs.utility
Prints progress information, including elapsed time and status message
ProgressMeter() - Constructor for class edu.georgetown.gucs.utility.ProgressMeter
Constructor that sets current time for later comparisons for elapsed time

R

readFile(String) - Method in class edu.georgetown.gucs.tokenizers.FileTokenizer
Splits a text document into tokens.
readFile(String) - Method in class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
Splits a gzipped text document into tokens.
readFile(String) - Method in class edu.georgetown.gucs.tokenizers.OutsideInFileTokenizer
Uses a JNI to call Oracle's Outside In API to extract text from a file and splits that text into tokens
RECALL - Static variable in class edu.georgetown.gucs.experiment.Global
 
recursiveLoadDirectory(String) - Method in class edu.georgetown.gucs.utility.FileLister
Recursively loads all the files from the given directory into the initial list of files
remove(int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Removes the given element from this set
RemoveNumericTokensTokenizer - Class in edu.georgetown.gucs.tokenizers
Eliminates tokens that are numbers
RemoveNumericTokensTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.RemoveNumericTokensTokenizer
Constructor that sets the Pattern to remove numeric tokens
RemoveTokensWithNumbersTokenizer - Class in edu.georgetown.gucs.tokenizers
Eliminates tokens containing numbers
RemoveTokensWithNumbersTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.RemoveTokensWithNumbersTokenizer
Constructor that sets the Pattern to remove tokens containing numbers
retrieveTokenizerList(List<String>) - Method in class edu.georgetown.gucs.tokenizers.TokenizerListManager
Finds the tokenizerList that contains the provided list of tokenizers; if no such tokenizerList exists then it creates a new list with these tokenizers
run() - Method in class edu.georgetown.gucs.dictionary.DictionaryWorker
Adds a file to an existing dictionary using the given tokenizers
run() - Method in class edu.georgetown.gucs.experiment.Trial
Runs the trial
runExperiment() - Method in class edu.georgetown.gucs.experiment.Experiment
Runs this experiment
runTrial() - Method in class edu.georgetown.gucs.experiment.Trial
Runs this trial
runTrials() - Method in class edu.georgetown.gucs.experiment.Experiment
Runs trials for each setting

S

saveDictionary(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Writes this dictionary as a serialized object.
saveDictionaryXML(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Writes this dictionary as a serialized XML object.
saveTokenizers(int, List<String>) - Method in class edu.georgetown.gucs.experiment.DBInterface
Adds a list of tokenizers to an experiment
ScoreFingerprints - Class in edu.georgetown.gucs.matcher
Compares two XML fingerprints and provides a score for their degree of similarity.
ScoreFingerprints(String, String, String) - Constructor for class edu.georgetown.gucs.matcher.ScoreFingerprints
Constructor that creates a matcher object and two Base64 encoded fingerprint string objects
sdtext - package sdtext
 
serialVersionUID - Static variable in class edu.georgetown.gucs.dictionary.Dictionary
 
serialVersionUID - Static variable in class edu.georgetown.gucs.dictionary.DictionaryEntry
 
set(A, B) - Method in class edu.georgetown.gucs.utility.Pair
Sets the both objects in this pair
setA(A) - Method in class edu.georgetown.gucs.utility.Pair
Sets the first object in this pair
setB(B) - Method in class edu.georgetown.gucs.utility.Pair
Sets the second object in this pair
setCreatingProgram() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Determines the name of the program that created this dictionary
setCreatingProgram(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Sets the name of the program that created this dictionary
setCreator(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Sets the creator (person/organization) of this fingerprinter.
setDictionary(Dictionary) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Sets the dictionary for this fingerprinter.
setDictionary(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Sets the dictionary for this fingerprinter.
setLanguage(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Sets the language of this dictionary
setMangler(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Passes the specified mangler settings to the mangler for this fingerprinter.
setMangler(String, Dictionary) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Passes the specified mangler settings and a set of tokens to the mangler for this fingerprinter.
setManglerRNG(Random) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Sets a random number generator for the mangler to allow for repeatability.
setManglerRNG(Random) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Sets the random number generator to use with a FileManglerTokenizer object in this list
setMinimumScore(int) - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
Sets the minimum score for these two fingerprints to be considered a match.
setMode(String) - Method in class edu.georgetown.gucs.tokenizers.ChineseFileTokenizer
Sets the token creation mode.
setMode(String) - Method in class edu.georgetown.gucs.tokenizers.FileTokenizer
Sets the token creation mode.
setOutput(boolean, boolean, boolean) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Specifies which information to display in this fingerprinter's digest XML output
setPosition(int) - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
Sets the position of the token in this dictionary.
setRandomNumberGenerator(Random) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Sets a random number generator to allow for repeatability.
setRNG(Random) - Method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
Sets the random number generator to use with the manglers that are set in this tokenizer
setTerse() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Sets this fingerprinter's digest XML output to only display file and fingerprint information
setTokenizers(TokenizerList) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Loads the tokenizers to use for this dictionary from a TokenizerList object.
setTokenizers(Vector<String>) - Method in class edu.georgetown.gucs.tokenizers.TokenizeFile
Sets the tokenizers to use on this document
setTokenizersByName(List<String>) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Sets the tokenizers used by this dictionary; Only works if no tokenizers have already been set or if the given list contains the same tokenizers as those that have already been set
setTrueMatches(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Sets all the matches for each fingerprint ID in this database
setVerbose() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
Sets this fingerprinter's digest XML output to display all available information
showDataSource - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
showDictionary() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Prints information about this dictionary, including all tokens and their statistics.
showDictionary - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
ShowDictionaryStatistics - Class in edu.georgetown.gucs.dictionary
Prints statistics for a Dictionary: Name Language Number of documents Number of tokens If the dictionary has been trimmed (including the IDF range, if trimmed) Minimum and maximum IDF List of tokenizers
ShowDictionaryStatistics(String) - Constructor for class edu.georgetown.gucs.dictionary.ShowDictionaryStatistics
Constructor that specifies the dictionary and its language
ShowDictionaryTokens - Class in edu.georgetown.gucs.dictionary
Prints all the tokens and their frequencies, IDFs, and normalized IDFs for a Dictionary
ShowDictionaryTokens(String) - Constructor for class edu.georgetown.gucs.dictionary.ShowDictionaryTokens
Constructor that specifies the dictionary and its language
showDigest - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
showSettings() - Method in class edu.georgetown.gucs.experiment.Experiment
Prints out the properties for this experiment
showSettings() - Method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
Prints mangler settings to standard output
showStatistics() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Prints this dictionary statistics, including: Name Language Number of documents Number of tokens If this dictionary has been trimmed (including the IDF range, if trimmed) Minimum and maximum IDF List of tokenizers
showTokenizers() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Prints the names of the tokenizers used for this dictionary
showTokens() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Prints each token in this dictionary with it's frequency, IDF, and normalized IDF
size() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
TODO
split_tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Applies each tokenizer from this list, in order, on the tokens provided by a Splitter object; no tokenizers that are able to read from a file should be present in this list
start(String) - Method in class edu.georgetown.gucs.utility.ProgressMeter
Prints the status message and the current time for starting this progressMeter
startBulkFileAdd() - Method in class edu.georgetown.gucs.experiment.DBInterface
Drops all the indexes on the file table of this database
startBulkFileMatchAdd() - Method in class edu.georgetown.gucs.experiment.DBInterface
Loads all the existing matches into a cache of file matches
startBulkFingerprintInsertion() - Method in class edu.georgetown.gucs.experiment.DBInterface
Drops the indexes on the fingerprint_file_id, fingerprint_trial_id, and fingerprint_mangler_id tables
startBulkFingerprintMatchInsertion(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
Drops the indexes on the fingerprint_match tables
startNewRun() - Method in class edu.georgetown.gucs.experiment.DBInterface
Starts a new experiment run
startTraining(List<String>) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Starts training, which permits tokens to be added to this dictionary.
Statistics - Class in edu.georgetown.gucs.experiment
Provides values for confidence interval
Statistics() - Constructor for class edu.georgetown.gucs.experiment.Statistics
 
stop(String) - Method in class edu.georgetown.gucs.utility.ProgressMeter
Prints the status message, time elapsed since starting this progressMeter, and the current time for stopping this progressMeter
StopWordRemoverTokenizer - Class in edu.georgetown.gucs.tokenizers
Eliminates tokens specified in the given stop words document; Assumes this document contains lower case, non-porterized English word
StopWordRemoverTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.StopWordRemoverTokenizer
Constructor that loads stop words to use when tokenize method is called;
StripMarkupTokenizer - Class in edu.georgetown.gucs.tokenizers
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using whitespace
StripMarkupTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.StripMarkupTokenizer
 
StripMarkupTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.StripMarkupTokenizer
Constructor that specifies whether to keep tokens nested inside <\script> tags; the default is to eliminate these tokens
StripPunctuationTokenizer - Class in edu.georgetown.gucs.tokenizers
Separates tokens based on punctuation and removes punctuation from tokens
StripPunctuationTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.StripPunctuationTokenizer
 
submitTask(Runnable) - Method in class edu.georgetown.gucs.utility.BoundedExecutor
Calls run method of given instance; blocks until a thread is available
subset(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Returns whether this bitset is contained within a
subtractInPlace(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Subtracts the elements of the given antlrBitSet from this bitset in-place (turn off all bits of this bitset that are in the given antlrBitSet)
systemID - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 

T

targetFile - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
threadingOff() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Turns off threading for creating this dictionary
threadingOn() - Method in class edu.georgetown.gucs.dictionary.Dictionary
Turns on threading for creating this dictionary.
threadingOn(int) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Turns on threading for creating this dictionary.
toArray() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
TODO
toBytes() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Converts this bitset to an array of bytes
toHex() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Converts this bitset to a hexadecimal string
tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.ArabicFileTokenizer
Splits the document into tokens.
tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.ChineseFileTokenizer
Splits a Chinese or mixed Chinese-English document into tokens based on the token creation mode.
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.EnronEmailTokenizer
Eliminates tokens related to attachments in the Enron data set
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.EnronEmailTokenizer
Eliminates tokens related to attachments in the Enron data set
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.EnronStripMailHeaderTokenizer
Eliminates tokens from headers in the Enron data set
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.EnronStripMailHeaderTokenizer
Eliminates tokens from headers in the Enron data set
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
Alters or eliminates certain tokens based on the given mangler settings
tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.FileTokenizer
Splits the document into tokens.
tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
Splits a gzipped text document into tokens.
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.MaximumLengthTokenizer
Eliminates tokens that are longer than the length specified in the constructor
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.MaximumLengthTokenizer
Eliminates tokens that are longer than the length specified in the constructor
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.MinimumLengthTokenizer
Eliminates tokens that are shorter than the length specified in the constructor
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.MinimumLengthTokenizer
Eliminates tokens that are shorter than the length specified in the constructor
tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.OutsideInFileTokenizer
 
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.PorterTokenizer
Changes English language tokens into their root form
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.PorterTokenizer
Changes English language tokens into their root form
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.RemoveNumericTokensTokenizer
Eliminates tokens that are numbers
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.RemoveNumericTokensTokenizer
Eliminates tokens that are numbers
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.RemoveTokensWithNumbersTokenizer
Eliminates tokens containing numbers
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.RemoveTokensWithNumbersTokenizer
Eliminates tokens containing numbers
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.StopWordRemoverTokenizer
Eliminates tokens specified in the given stop words document
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.StopWordRemoverTokenizer
Eliminates tokens specified in the given stop words document
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.StripMarkupTokenizer
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using whitespace
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.StripMarkupTokenizer
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using whitespace
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.StripPunctuationTokenizer
Separates tokens based on punctuation and removes punctuation from tokens
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.StripPunctuationTokenizer
Separates tokens based on punctuation and removes punctuation from tokens
tokenize(String, Vector<String>) - Method in class edu.georgetown.gucs.tokenizers.TokenizeFile
Splits the given file into tokens and alters or eliminates those tokens based on the vector of tokenizers.
tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
Alters or eliminates certain tokens.
tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
Alters or eliminates certain tokens when using Splitter.
tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
Splits the document into tokens.
tokenize() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Applies each tokenizer from this list, in order, on the tokens; the first tokenizer must be able to read from a file and create the list of tokens
TokenizeFile - Class in edu.georgetown.gucs.tokenizers
Tests tokenizers using a configuration file that contains a list of Tokenizer objects to use on a document and prints the results of the tokenization
TokenizeFile() - Constructor for class edu.georgetown.gucs.tokenizers.TokenizeFile
Constructor that initializes an empty string vector of tokenizers to use
TokenizeFile(String) - Constructor for class edu.georgetown.gucs.tokenizers.TokenizeFile
Constructor that sets the tokenizers from a configuration file to use on this document.
tokenizeFile(String) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Applies each tokenizer from this list, in order, on the file; the first tokenizer must be able to read from a file and the tokenizers must already be instantiated
tokenizeFile(File) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
Applies each tokenizer from this list, in order, on the file; the first tokenizer must be able to read from a file and the tokenizers must already be instantiated
Tokenizer - Class in edu.georgetown.gucs.tokenizers
Splits a document into tokens (either by line or by word) or alters the tokens in various ways.
Tokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.Tokenizer
 
TokenizerList - Class in edu.georgetown.gucs.tokenizers
An ordered list of Tokenizer objects to split a document into tokens and alter the tokens in various ways.
TokenizerList() - Constructor for class edu.georgetown.gucs.tokenizers.TokenizerList
Constructor that initializes an empty list of tokenizers
TokenizerList(List<String>) - Constructor for class edu.georgetown.gucs.tokenizers.TokenizerList
Constructor that takes a list of the tokenizer names
TokenizerListManager - Class in edu.georgetown.gucs.tokenizers
Ensures that TokenizerList objects are only created once.
TokenizerListManager() - Constructor for class edu.georgetown.gucs.tokenizers.TokenizerListManager
Constructor that initializes a hashMap holding all tokenizerList objects
tokenizers - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
tokenVector - Variable in class edu.georgetown.gucs.tokenizers.Tokenizer
The list of each token in order of its appearance
toPackedArray() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Creates and returns a copy of this antlrBitSet's bitset
toString() - Method in class edu.georgetown.gucs.experiment.TrialParameters
Provides a string representation of the parameters needed for results comparisons
toString() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Transform a bit set into a string separated by commas by formatting each element as an integer
toString(String) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Transform a bit set into a string by formatting each element as an integer
toString(String, List<String>) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Creates a string representation where instead of integer elements, the ith element of a list of strings is displayed
toString() - Method in class edu.georgetown.gucs.utility.Pair
Provides a string representation of the pair in the form (a, b)
toStringOfHalfWords() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Dumps a comma-separated list of the words making up the bit set; Splits each 64 bit number into two more manageable 32 bit numbers; Generates a comma-separated list of C++-like unsigned long constants
toStringOfWords() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Dumps a comma-separated list of the words making up the bit set; Generates a comma-separated list of Java-like long int constants
totalDocuments - Variable in class edu.georgetown.gucs.dictionary.Dictionary
 
Trial - Class in edu.georgetown.gucs.experiment
Uses a given set of parameters to create a dictionary and fingerprints of a set of documents.
Trial(int, boolean, DBInterface, String, FileLister, File, String, long, double, double, double, double, List<String>, String, String, String, int) - Constructor for class edu.georgetown.gucs.experiment.Trial
Constructor that sets default values for this trial
TrialParameters - Class in edu.georgetown.gucs.experiment
Provides the trial parameters that are needed for results comparisons; includes dataset, sample parameter, dictionary parameter, min and max IDF, fingerprinter name, matcher name and minimum score, tokenizers and manglers
TrialParameters(String, String, String, String, String, String, String, String, String, String) - Constructor for class edu.georgetown.gucs.experiment.TrialParameters
Constructor that specifies all the parameters for a trial
trim(String) - Method in class edu.georgetown.gucs.dictionary.TrimDictionary
 
trimByIDF(double, double) - Method in class edu.georgetown.gucs.dictionary.Dictionary
Trims this dictionary by removing any token that is outside a range of normalized IDFs.
TrimDictionary - Class in edu.georgetown.gucs.dictionary
Creates a Dictionary containing tokens with IDFs within a specified range
TrimDictionary(double, double, String) - Constructor for class edu.georgetown.gucs.dictionary.TrimDictionary
Constructor that specifies the IDF range and dictionary to trim
TRUE_NEGATIVE - Static variable in class edu.georgetown.gucs.experiment.Global
 
TRUE_POSITIVE - Static variable in class edu.georgetown.gucs.experiment.Global
 

U

unknownTokens - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
update(String) - Method in class edu.georgetown.gucs.utility.ProgressMeter
Prints the status message and time elapsed since starting this progressMeter

V

version - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 
volume - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
 

X

xorInPlace(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
Performs XOR operation
A B C D E F G H I L M N O P R S T U V X