- add(int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Adds an element to this set using OR operator; set grows as necessary to accommodate
- addDictionaryToTrial(int, int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Adds this dictionary to this trial in this database
- addDocument(Iterator<String>) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Adds a document to this dictionary.
- addFile(File) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the ID for the file if it already exists in the file this database or creates a new ID and adds it
- addFileMatch(File, File, double, double) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Checks if files match and if so adds files and their normalized min and max IDF to the file_match table
- addManglerToTrial(int, int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Adds the mangler ID to the trial in this database
- addNullManglerToTrial(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Adds a "null" mangler to position 1 of the trial_mangler table
- addToFingerprintMatch(int, int, int, boolean, boolean, int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Adds fingerprint match information if the match is not already recorded in the fingerprint_match table of this
database
- addTokenizer(String) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Adds a tokenizer to the tokenizer table in this database
- addTokenizer(String) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Adds a tokenizer to the end of this list of tokenizers.
- addTokenizerNames(List<String>) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Adds tokenizers to the end of this list of tokenizers
- addTokenizers(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Reads an XML file that specifies which tokenizers to use for this dictionary
- addTokenizers(String) - Method in class edu.georgetown.gucs.tokenizers.TokenizeFile
-
Adds tokenizers from an XML file to the list of tokenizers to use on this document
- addTrialResults(int, int, int, int, int, int, double, double, double) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Adds the results from a trial to this database
- addTrialToExperiment(int, int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Adds the trial ID to the experiment in this database
- and(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Performs AND operation
- andInPlace(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Performs AND operation
- AntlrBitSet - Class in edu.georgetown.gucs.utility
-
Creates a BitSet to replace BitSet.
- AntlrBitSet() - Constructor for class edu.georgetown.gucs.utility.AntlrBitSet
-
Constructs a bitset of size one word (64 bits)
- AntlrBitSet(long[]) - Constructor for class edu.georgetown.gucs.utility.AntlrBitSet
-
Constructs a clone of the static array of longs
- AntlrBitSet(int) - Constructor for class edu.georgetown.gucs.utility.AntlrBitSet
-
Constructs a bitset of given size
- AntlrBitSet(byte[]) - Constructor for class edu.georgetown.gucs.utility.AntlrBitSet
-
Constructor that converts and sets this bitset
- ArabicFileTokenizer - Class in edu.georgetown.gucs.tokenizers
-
Splits contents of an Arabic text file into tokens using Apache Lucene's ArabicAnalyzer.
- ArabicFileTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.ArabicFileTokenizer
-
- asHex(byte[]) - Static method in class edu.georgetown.gucs.utility.Converter
-
Converts bytes to a hex string
- calculateTrialResults() - Method in class edu.georgetown.gucs.experiment.Trial
-
Computes the precision, recall, and f-score for each mangler
- changeRandomSeed(int, long) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Updates this database with the random seed to use for this trial
- checkIndexing(String, int, int) - Method in class edu.georgetown.gucs.tokenizers.FileTokenizer
-
Prints to the screen a portion of the given text document
- checkIndexing(String, int, int) - Method in class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
-
Prints to the screen a portion of the given gzipped text document
- checkIndexing(String, int, int) - Method in class edu.georgetown.gucs.tokenizers.OutsideInFileTokenizer
-
Prints to the screen a portion of the given document
- ChineseFileTokenizer - Class in edu.georgetown.gucs.tokenizers
-
Splits contents of a Chinese or Chinese-English text file into tokens using Apache Lucene's
ChineseAnalyzer or SmartChineseAnalyzer.
- ChineseFileTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.ChineseFileTokenizer
-
Constructor that sets the token creation mode to split by smart tokenization using probabilistic word segmentation
- ChineseFileTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.ChineseFileTokenizer
-
Constructor that sets the token creation mode.
- clear() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Clears this list of tokenizers
- clear() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Clears all elements in this bitset
- clear(int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Clears the given element in this bitset
- clearExperimentResults(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Removes the experiment results from experiment_results table
- clone(Dictionary) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Creates a deep copy of the given dictionary object and stores it in this dictionary
- clone() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Creates and returns a copy of this antlrBitSet
- close() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Closes this database
- CommandLine - Class in sdtext
-
Provides a command line interface to run the different programs within SDTEXT (Similarity Digest Text)
- CommandLine() - Constructor for class sdtext.CommandLine
-
- CompareDirectory - Class in edu.georgetown.gucs.matcher
-
Outputs a list of files and their scores, comparing a given fingerprint to a directory of files.
- CompareDirectory(String, String, String) - Constructor for class edu.georgetown.gucs.matcher.CompareDirectory
-
Constructor that sets the matcher, fingerprint file and directory to use for this comparison
- CompareDirectory(String, String, String, int) - Constructor for class edu.georgetown.gucs.matcher.CompareDirectory
-
Constructor that sets the matcher, fingerprint file, directory and minimum score to use for this comparison
- CompareDirectory(String, String, String, int, String) - Constructor for class edu.georgetown.gucs.matcher.CompareDirectory
-
Constructor that sets the matcher, fingerprint file, directory, minimum score and dictionary to use for this
comparison
- computeFingerprint(String) - Method in class edu.georgetown.gucs.fingerprinter.BitVectorFingerprinter
-
Computes a byte array fingerprint indicating the presence or absence of each token in this dictionary
- computeFingerprint(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Computes the fingerprint of this document as a byte array; indicates the presence or absence of each token in this
dictionary
- computeFingerprints() - Method in class edu.georgetown.gucs.matcher.CompareDirectory
-
Compute a fingerprint for each file in this directory and compare it to this fingerprint using this matcher; if the
matcher returns boolean values, true values are given a score of 99 and false values are given a score of 0
- computeFingerprintXML(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Computes the fingerprint of a document as a Base64 encoded string; indicates the presence or absence of each token
in this dictionary
- containsToken(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Determines if this dictionary contains a particular token
- Converter - Class in edu.georgetown.gucs.utility
-
Converts between bits, bytes, and hex
- Converter() - Constructor for class edu.georgetown.gucs.utility.Converter
-
- CosineSimilarityFingerprintMatcher - Class in edu.georgetown.gucs.matcher
-
- CosineSimilarityFingerprintMatcher() - Constructor for class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
-
Constructor that sets the minimum score to use for matching two fingerprints to zero
- CosineSimilarityFingerprintMatcher(int) - Constructor for class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
-
Constructor that sets the minimum score to use for matching two fingerprints
- CosineSimilarityFingerprintMatcher(String) - Constructor for class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
-
Constructor that sets the minimum score to use for matching two fingerprints.
- countResults(int, int, int, boolean, boolean) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Counts the number of entries with the given manglers and matching results in the fingerprint_match table for a
trial
- createFromFileLister(FileLister) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Creates this dictionary using an optional number or percent of the files in the given fileLister; if no count or
percent is specified then uses entire fileList to create dictionary
- createFromFileLister(FileLister, double) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Creates this dictionary using an optional number or percent of the files in the given fileLister
- createNewListFromCount(int) - Method in class edu.georgetown.gucs.utility.FileLister
-
Creates a new list of randomly chosen files of the given size from this fileLister
- createNewListFromPercent(double) - Method in class edu.georgetown.gucs.utility.FileLister
-
Creates a new list of randomly chosen files that is a percent of the files from this fileLister
- creatingProgram - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
- creator - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
- FALSE_NEGATIVE - Static variable in class edu.georgetown.gucs.experiment.Global
-
- FALSE_POSITIVE - Static variable in class edu.georgetown.gucs.experiment.Global
-
- fileExists(File) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Checks whether the file exists in the database
- FileLister - Class in edu.georgetown.gucs.utility
-
Creates a list of files in a directory and allows iteration over all and parts of this list
- FileLister() - Constructor for class edu.georgetown.gucs.utility.FileLister
-
Default constructor
- FileLister(String) - Constructor for class edu.georgetown.gucs.utility.FileLister
-
Constructor that loads all the files from the given directory
- FileManglerTokenizer - Class in edu.georgetown.gucs.tokenizers
-
Manipulates files as they are read.
- FileManglerTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
-
Constructor that initializes the random number generator and clears the mangler settings
- FileManglerTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
-
Constructor that sets the mangler settings
- FileTokenizer - Class in edu.georgetown.gucs.tokenizers
-
Splits contents of a text file into tokens based on whitespace or by line.
- FileTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.FileTokenizer
-
Constructor that sets the token creation mode to split based on whitespace
- FileTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.FileTokenizer
-
Constructor that sets the token creation mode.
- finalize() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Closes this database
- Fingerprinter - Class in edu.georgetown.gucs.fingerprinter
-
This is the base class for all fingerprinters.
- Fingerprinter() - Constructor for class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Constructor that generates the fingerprint name, version, unique identifier (GUID), system identifier, and creating
program.
- Fingerprinter(Dictionary) - Constructor for class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Constructor that loads a dictionary and its tokenizers.
- Fingerprinter(String) - Constructor for class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Constructor that loads a dictionary and its tokenizers.
- FingerprintMatcher - Class in edu.georgetown.gucs.matcher
-
This is the base class for all fingerprint matchers.
- FingerprintMatcher() - Constructor for class edu.georgetown.gucs.matcher.FingerprintMatcher
-
Constructor that sets the minimum score to use for matching two fingerprints to zero
- FingerprintMatcher(int) - Constructor for class edu.georgetown.gucs.matcher.FingerprintMatcher
-
Constructor that sets the minimum score to use for matching two fingerprints
- fingerprintName - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
- finishBulkFileAdd() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Rebuild indexes for file_full_path table
- finishBulkFileMatchAdd() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Rebuild indexes for file_match table
- finishBulkFingerprintInsertion() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Rebuild indexes for fingerprint_file_id, fingerprint_trial_id, and fingerprint_mangler_id tables
- finishBulkFingerprintMatchInsertion() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Rebuilds indexes on the fingerprint_match tables
- finishRun(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Finishes an experiment run
- FSCORE - Static variable in class edu.georgetown.gucs.experiment.Global
-
- generateCreatingProgram() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Determines the program that created this fingerprinter
- generateXML(String, String, String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Generates an XML file that contains this fingerprinter's digest
- getA() - Method in class edu.georgetown.gucs.utility.Pair
-
Provides the first object in this pair
- getAllFingerprintsFromTrial(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides all the fingerprints and fingerprint IDs for a trial
- getAllMatchingFileIDs(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides all the file IDs of files that have matches
- getB() - Method in class edu.georgetown.gucs.utility.Pair
-
Provides the second object in this pair
- getBase64Fingerprint() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Gives this fingerprinter's fingerprint in Base64 encoding
- getCreatingProgram() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the program that created this dictionary
- getCreatingProgram() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Gives the program that created this fingerprinter
- getCreation() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the original creation date of this dictionary
- getDataset() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides the dataset for this trial
- getDictionary() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Gives this fingerprinter's dictionary
- getDictionaryFilename() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the filename of this dictionary, if this dictionary was loaded from or saved to a file
- getDictionaryFilename() - Method in class edu.georgetown.gucs.fingerprinter.ExtractDictionary
-
Provides the filename of this extracted dictionary
- getDictionaryName() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the name of this dictionary
- getDictionaryParameter() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides the dictionary count or percent for this trial
- getDictionarySize() - Method in class edu.georgetown.gucs.experiment.Trial
-
Provides the size of the dictionary used
- getDirectory() - Method in class edu.georgetown.gucs.utility.FileLister
-
Provides the path to the directory used for the initial list of files
- getDocumentCount() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
-
Provides the number of documents this token appears in
- getExperimentDatasetPath(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the path of the dataset used in an experiment
- getExperimentDescription(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the description for an experiment
- getExperimentDictionarySetting(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the count or percent size of files used for the dictionary in an experiment
- getExperimentFingerprinterName(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the fingerprinter name used in an experiment
- getExperimentLanguage(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the language used in an experiment
- getExperimentManglerSettings(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the mangler names and settings used in an experiment
- getExperimentMatcherName(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the matcher name used in an experiment
- getExperimentMatcherParameter(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the minimum score used for the matcher in an experiment
- getExperimentMaximumIDF(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the maximum normalized IDF for an experiment's dictionary
- getExperimentMinimumIDF(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the minimum normalized IDF for an experiment's dictionary
- getExperimentSampleSetting(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the count or percent size of the sample of files used for an experiment
- getExperimentTokenizers(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the tokenizer names used in an experiment
- getExperimentTrialCount(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the number of trials in an experiment
- getFileID(File) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the ID of the file or 0 if it is not in the this database
- getFileTypes() - Method in class edu.georgetown.gucs.utility.FileLister
-
Provides the counts for each file extension that exists in this list of files
- getFingerprint(String) - Method in class edu.georgetown.gucs.matcher.ScoreFingerprints
-
Extracts the Base64 encoded fingerprint string from the given XML fingerprint digest file
- getFingerprinter() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides the fingerprinter for this trial
- getFingerprintName() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Gives this fingerprinter's name.
- getFrequency(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the number of times the given token appeared when creating this dictionary
- getFrequencyCount() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
-
Provides the frequency count for this entry
- getGUID() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the unique identifier for this dictionary
- getIDF(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the inverse document frequency (IDF) of the given token in this dictionary
- getIDF() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
-
Provides the IDF of this token
- getIDF(int) - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
-
Computes the IDF of this token using the totalDocuments
- getIterator() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Provides the string iterator over this document's tokens from the last tokenizer from this list
- getLanguage() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the language of this dictionary
- getManglerID(String) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the ID for the existing mangler setting that matches or creates a new one
- getManglerResults() - Method in class edu.georgetown.gucs.experiment.Trial
-
Provides the results for each mangler; sleeps until the trial is finished
- getManglers() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides the manglers for this trial
- getManglersForTrial(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides all the mangler IDs for a trial
- getMatcher() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides the matcher for this trial
- getMatcherName() - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
-
Provides the name of the matcher used to determine in the two fingerprints have matching documents
- getMatcherScore() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides the matcher minimum score for this trial
- getMatchingFileCount() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the number of file matches
- getMaxIDF() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the largest IDF value in this dictionary
- getMaxIDF() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides the maximum IDF for this trial
- getMean(List<Double>) - Static method in class edu.georgetown.gucs.experiment.Statistics
-
Provides the mean for the given values
- getMinIDF() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides the minimum IDF for this trial
- getMinimumScore() - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
-
Provides the minimum score for these two fingerprints to be considered a match.
- getMode() - Method in class edu.georgetown.gucs.tokenizers.FileTokenizer
-
Provides the token creation mode
- getNames() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Provides the ordered list of the tokenizer names in this list
- getNewListIterator() - Method in class edu.georgetown.gucs.utility.FileLister
-
Provides an iterator over the updated list of files; need to call newCountIterator
or
newPercentIterator
to update the list of files
- getNormalizedIDF(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the normalized inverse document frequency (IDF) of the given token in this dictionary.
- getNumberOfFiles() - Method in class edu.georgetown.gucs.utility.FileLister
-
Provides the size of the initial list of files
- getOriginalFingerprintsFromTrial(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the fingerprints and fingerprint IDs with null manglers for a trial
- getPosition(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the position of the given token in this dictionary
- getPosition() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
-
Provides the position of the token in this dictionary.
- getPositionsVector() - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
-
Provides the list of integer pairs corresponding to the position of each token
- getRandomToken() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Gives a random token from this dictionary
- getSampleParameter() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides the sample count or percent for this trial
- getScore(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
-
Determines the cosine similarity score for these two fingerprints
- getScore(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.ExactFingerprintMatcher
-
Determines a similarity score for these two fingerprints
- getScore(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
-
Determines a similarity score for these two fingerprints.
- getScoreXML(String, String) - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
-
Determines a similarity score for these two fingerprints.
- getSize() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the number of tokens in this dictionary
- getSource() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the directory used for this dictionary
- getStandardDeviation(List<Double>) - Static method in class edu.georgetown.gucs.experiment.Statistics
-
Provides the standard deviation for the given values
- getStandardDeviation(List<Double>, double) - Static method in class edu.georgetown.gucs.experiment.Statistics
-
Provides the standard deviation for the given values
- getSystemID() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the identifier for the system that this dictionary was created on
- getToken() - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
-
Provides the name of the token
- getTokenizerID(String) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides the ID of the tokenizer or 0 if it is not in the tokenizer table of this database
- getTokenizerNames() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the names of the tokenizers used to create this dictionary
- getTokenizers() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides the tokenizers for this trial
- getTokenizers() - Method in class edu.georgetown.gucs.tokenizers.TokenizeFile
-
Provides the list of tokenizers to use on this document
- getTokenizers() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Provides the ordered list of the tokenizers in this list
- getTokens() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides a vector of the tokens in this dictionary
- getTokenVector() - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
-
Provides the list of each token in order of its appearance
- getTotalDocuments() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the number of documents processed for this dictionary
- getTrialFileIDs(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Provides all the file IDs and whether they have matches in a trial
- getTrialParameters() - Method in class edu.georgetown.gucs.experiment.Trial
-
Provides the parameters for this trial that are needed for results comparisons
- getTValueNinetyFive(int) - Static method in class edu.georgetown.gucs.experiment.Statistics
-
Provides a value for the confidence interval
- getVersion() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Provides the version number of this dictionary.
- Global - Class in edu.georgetown.gucs.experiment
-
Global information for an Experiment
- Global() - Constructor for class edu.georgetown.gucs.experiment.Global
-
- growToInclude(int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Grows the set to a larger number of bits
- GUID - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
- GzippedFileTokenizer - Class in edu.georgetown.gucs.tokenizers
-
Splits contents of a gzipped text file into tokens based on whitespace or by line.
- GzippedFileTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
-
Constructor that sets the token creation mode to split based on whitespace
- GzippedFileTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
-
Constructor that sets the token creation mode.
- main(String[]) - Static method in class edu.georgetown.gucs.dictionary.Dictionary
-
Creates a dictionary from the specified directory using the provided tokenizers
- main(String[]) - Static method in class edu.georgetown.gucs.dictionary.ShowDictionaryStatistics
-
Prints statistics for a dictionary, including name, language, number of documents, number of tokens, minimum and
maximum IDF, list of tokenizers, and whether the dictionary has been trimmed
- main(String[]) - Static method in class edu.georgetown.gucs.dictionary.ShowDictionaryTokens
-
Prints a dictionary's tokens and their frequencies, IDFs, and normalized IDFs
- main(String[]) - Static method in class edu.georgetown.gucs.dictionary.TrimDictionary
-
Trims a dictionary by removing any token that is outside a range of normalized IDFs
- main(String[]) - Static method in class edu.georgetown.gucs.experiment.Experiment
-
Runs experiments using multiple runs with different settings and generates results
- main(String[]) - Static method in class edu.georgetown.gucs.fingerprinter.BitVectorFingerprinter
-
Creates a bit vector fingerprint representing the presence or absence of terms in a particular dictionary
- main(String[]) - Static method in class edu.georgetown.gucs.fingerprinter.ExtractDictionary
-
Extracts a dictionary from a fingerprint digest and writes it to a new dictionary file
- main(String[]) - Static method in class edu.georgetown.gucs.matcher.CompareDirectory
-
Outputs a list of files and their scores, comparing a given fingerprint to a directory of files.
- main(String[]) - Static method in class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
-
- main(String[]) - Static method in class edu.georgetown.gucs.matcher.ExactFingerprintMatcher
-
- main(String[]) - Static method in class edu.georgetown.gucs.matcher.ScoreFingerprints
-
Compares two XML fingerprints and provides a score for their degree of similarity.
- main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.ArabicFileTokenizer
-
Tokenizes an Arabic text file and prints the resulting tokens to the the screen
- main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.ChineseFileTokenizer
-
Tokenizes a Chinese text file into single characters and using a probabilistic model and prints both sets of
resulting tokens to the the screen
- main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
-
- main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.FileTokenizer
-
Tokenizes a text file and prints the resulting tokens to the the screen
- main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
-
Tokenizes a gzipped text file and prints the resulting tokens to the the screen
- main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.OutsideInFileTokenizer
-
Tokenizes a file using OutsideIn and prints the resulting tokens to the the screen
- main(String[]) - Static method in class edu.georgetown.gucs.tokenizers.TokenizeFile
-
Tokenizes a text file and prints the resulting tokens to the the screen
- main(String[]) - Static method in class sdtext.CommandLine
-
Provides a command line interface to run the different SDTEXT programs
- makeDictionary(String, String, String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Creates dictionary from the given directory; Reads an XML file that specifies which tokenizers to use and outputs
dictionary to a file
- match(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.CosineSimilarityFingerprintMatcher
-
Determines that the two fingerprints are matching if their similarity score is at or above the minimum score for
this fingerprinter
- match(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.ExactFingerprintMatcher
-
Determines that the two fingerprints are matching if their byte arrays are equal
- match(byte[], byte[]) - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
-
Determines if the two fingerprints' documents match.
- match(String, String) - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
-
Determines if these two fingerprints' documents match.
- matcherName - Variable in class edu.georgetown.gucs.matcher.FingerprintMatcher
-
- matches(File) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Matches the given file against the files in this database
- MaximumLengthTokenizer - Class in edu.georgetown.gucs.tokenizers
-
Eliminates tokens that are longer than a given length
- MaximumLengthTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.MaximumLengthTokenizer
-
Constructor that sets the maximum token length to be considered
- member(int) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
TODO
- mergeDictionary(Map<String, DictionaryEntry>) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Merges the given Map of tokens to dictionaryEntry with this dictionary
- minimum_score - Variable in class edu.georgetown.gucs.matcher.FingerprintMatcher
-
- MinimumLengthTokenizer - Class in edu.georgetown.gucs.tokenizers
-
Eliminates tokens that are shorter than a given length
- MinimumLengthTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.MinimumLengthTokenizer
-
Constructor that sets the minimum token length to be considered
- MOD_MASK - Static variable in class edu.georgetown.gucs.utility.AntlrBitSet
-
A precomputed mod mask.
- mode - Variable in class edu.georgetown.gucs.tokenizers.FileTokenizer
-
The string specifying to split tokens based on whitespace ("tokens") or by line ("lines")
- saveDictionary(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Writes this dictionary as a serialized object.
- saveDictionaryXML(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Writes this dictionary as a serialized XML object.
- saveTokenizers(int, List<String>) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Adds a list of tokenizers to an experiment
- ScoreFingerprints - Class in edu.georgetown.gucs.matcher
-
Compares two XML fingerprints and provides a score for their degree of similarity.
- ScoreFingerprints(String, String, String) - Constructor for class edu.georgetown.gucs.matcher.ScoreFingerprints
-
Constructor that creates a matcher object and two Base64 encoded fingerprint string objects
- sdtext - package sdtext
-
- serialVersionUID - Static variable in class edu.georgetown.gucs.dictionary.Dictionary
-
- serialVersionUID - Static variable in class edu.georgetown.gucs.dictionary.DictionaryEntry
-
- set(A, B) - Method in class edu.georgetown.gucs.utility.Pair
-
Sets the both objects in this pair
- setA(A) - Method in class edu.georgetown.gucs.utility.Pair
-
Sets the first object in this pair
- setB(B) - Method in class edu.georgetown.gucs.utility.Pair
-
Sets the second object in this pair
- setCreatingProgram() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Determines the name of the program that created this dictionary
- setCreatingProgram(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Sets the name of the program that created this dictionary
- setCreator(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Sets the creator (person/organization) of this fingerprinter.
- setDictionary(Dictionary) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Sets the dictionary for this fingerprinter.
- setDictionary(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Sets the dictionary for this fingerprinter.
- setLanguage(String) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Sets the language of this dictionary
- setMangler(String) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Passes the specified mangler settings to the mangler for this fingerprinter.
- setMangler(String, Dictionary) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Passes the specified mangler settings and a set of tokens to the mangler for this fingerprinter.
- setManglerRNG(Random) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Sets a random number generator for the mangler to allow for repeatability.
- setManglerRNG(Random) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Sets the random number generator to use with a FileManglerTokenizer object in this list
- setMinimumScore(int) - Method in class edu.georgetown.gucs.matcher.FingerprintMatcher
-
Sets the minimum score for these two fingerprints to be considered a match.
- setMode(String) - Method in class edu.georgetown.gucs.tokenizers.ChineseFileTokenizer
-
Sets the token creation mode.
- setMode(String) - Method in class edu.georgetown.gucs.tokenizers.FileTokenizer
-
Sets the token creation mode.
- setOutput(boolean, boolean, boolean) - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Specifies which information to display in this fingerprinter's digest XML output
- setPosition(int) - Method in class edu.georgetown.gucs.dictionary.DictionaryEntry
-
Sets the position of the token in this dictionary.
- setRandomNumberGenerator(Random) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Sets a random number generator to allow for repeatability.
- setRNG(Random) - Method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
-
Sets the random number generator to use with the manglers that are set in this tokenizer
- setTerse() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Sets this fingerprinter's digest XML output to only display file and fingerprint information
- setTokenizers(TokenizerList) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Loads the tokenizers to use for this dictionary from a TokenizerList object.
- setTokenizers(Vector<String>) - Method in class edu.georgetown.gucs.tokenizers.TokenizeFile
-
Sets the tokenizers to use on this document
- setTokenizersByName(List<String>) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Sets the tokenizers used by this dictionary; Only works if no tokenizers have already been set or if the given list
contains the same tokenizers as those that have already been set
- setTrueMatches(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Sets all the matches for each fingerprint ID in this database
- setVerbose() - Method in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
Sets this fingerprinter's digest XML output to display all available information
- showDataSource - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
- showDictionary() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Prints information about this dictionary, including all tokens and their statistics.
- showDictionary - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
- ShowDictionaryStatistics - Class in edu.georgetown.gucs.dictionary
-
Prints statistics for a
Dictionary:
Name
Language
Number of documents
Number of tokens
If the dictionary has been trimmed (including the IDF range, if trimmed)
Minimum and maximum IDF
List of tokenizers
- ShowDictionaryStatistics(String) - Constructor for class edu.georgetown.gucs.dictionary.ShowDictionaryStatistics
-
Constructor that specifies the dictionary and its language
- ShowDictionaryTokens - Class in edu.georgetown.gucs.dictionary
-
Prints all the tokens and their frequencies, IDFs, and normalized IDFs for a
Dictionary
- ShowDictionaryTokens(String) - Constructor for class edu.georgetown.gucs.dictionary.ShowDictionaryTokens
-
Constructor that specifies the dictionary and its language
- showDigest - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
- showSettings() - Method in class edu.georgetown.gucs.experiment.Experiment
-
Prints out the properties for this experiment
- showSettings() - Method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
-
Prints mangler settings to standard output
- showStatistics() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Prints this dictionary statistics, including:
Name
Language
Number of documents
Number of tokens
If this dictionary has been trimmed (including the IDF range, if trimmed)
Minimum and maximum IDF
List of tokenizers
- showTokenizers() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Prints the names of the tokenizers used for this dictionary
- showTokens() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Prints each token in this dictionary with it's frequency, IDF, and normalized IDF
- size() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
TODO
- split_tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Applies each tokenizer from this list, in order, on the tokens provided by a Splitter object; no tokenizers that
are able to read from a file should be present in this list
- start(String) - Method in class edu.georgetown.gucs.utility.ProgressMeter
-
Prints the status message and the current time for starting this progressMeter
- startBulkFileAdd() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Drops all the indexes on the file table of this database
- startBulkFileMatchAdd() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Loads all the existing matches into a cache of file matches
- startBulkFingerprintInsertion() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Drops the indexes on the fingerprint_file_id, fingerprint_trial_id, and fingerprint_mangler_id tables
- startBulkFingerprintMatchInsertion(int) - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Drops the indexes on the fingerprint_match tables
- startNewRun() - Method in class edu.georgetown.gucs.experiment.DBInterface
-
Starts a new experiment run
- startTraining(List<String>) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Starts training, which permits tokens to be added to this dictionary.
- Statistics - Class in edu.georgetown.gucs.experiment
-
Provides values for confidence interval
- Statistics() - Constructor for class edu.georgetown.gucs.experiment.Statistics
-
- stop(String) - Method in class edu.georgetown.gucs.utility.ProgressMeter
-
Prints the status message, time elapsed since starting this progressMeter, and the current time for stopping this
progressMeter
- StopWordRemoverTokenizer - Class in edu.georgetown.gucs.tokenizers
-
Eliminates tokens specified in the given stop words document; Assumes this document contains lower case,
non-porterized English word
- StopWordRemoverTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.StopWordRemoverTokenizer
-
Constructor that loads stop words to use when tokenize
method is called;
- StripMarkupTokenizer - Class in edu.georgetown.gucs.tokenizers
-
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using
whitespace
- StripMarkupTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.StripMarkupTokenizer
-
- StripMarkupTokenizer(String) - Constructor for class edu.georgetown.gucs.tokenizers.StripMarkupTokenizer
-
Constructor that specifies whether to keep tokens nested inside <\script>
tags; the default is to
eliminate these tokens
- StripPunctuationTokenizer - Class in edu.georgetown.gucs.tokenizers
-
Separates tokens based on punctuation and removes punctuation from tokens
- StripPunctuationTokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.StripPunctuationTokenizer
-
- submitTask(Runnable) - Method in class edu.georgetown.gucs.utility.BoundedExecutor
-
Calls run
method of given instance; blocks until a thread is available
- subset(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Returns whether this bitset is contained within a
- subtractInPlace(AntlrBitSet) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Subtracts the elements of the given antlrBitSet from this bitset in-place (turn off all bits of this bitset that
are in the given antlrBitSet)
- systemID - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
- targetFile - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
- threadingOff() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Turns off threading for creating this dictionary
- threadingOn() - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Turns on threading for creating this dictionary.
- threadingOn(int) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Turns on threading for creating this dictionary.
- toArray() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
TODO
- toBytes() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Converts this bitset to an array of bytes
- toHex() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Converts this bitset to a hexadecimal string
- tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.ArabicFileTokenizer
-
Splits the document into tokens.
- tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.ChineseFileTokenizer
-
Splits a Chinese or mixed Chinese-English document into tokens based on the token creation mode.
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.EnronEmailTokenizer
-
Eliminates tokens related to attachments in the Enron data set
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.EnronEmailTokenizer
-
Eliminates tokens related to attachments in the Enron data set
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.EnronStripMailHeaderTokenizer
-
Eliminates tokens from headers in the Enron data set
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.EnronStripMailHeaderTokenizer
-
Eliminates tokens from headers in the Enron data set
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.FileManglerTokenizer
-
Alters or eliminates certain tokens based on the given mangler settings
- tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.FileTokenizer
-
Splits the document into tokens.
- tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.GzippedFileTokenizer
-
Splits a gzipped text document into tokens.
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.MaximumLengthTokenizer
-
Eliminates tokens that are longer than the length specified in the constructor
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.MaximumLengthTokenizer
-
Eliminates tokens that are longer than the length specified in the constructor
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.MinimumLengthTokenizer
-
Eliminates tokens that are shorter than the length specified in the constructor
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.MinimumLengthTokenizer
-
Eliminates tokens that are shorter than the length specified in the constructor
- tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.OutsideInFileTokenizer
-
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.PorterTokenizer
-
Changes English language tokens into their root form
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.PorterTokenizer
-
Changes English language tokens into their root form
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.RemoveNumericTokensTokenizer
-
Eliminates tokens that are numbers
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.RemoveNumericTokensTokenizer
-
Eliminates tokens that are numbers
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.RemoveTokensWithNumbersTokenizer
-
Eliminates tokens containing numbers
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.RemoveTokensWithNumbersTokenizer
-
Eliminates tokens containing numbers
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.StopWordRemoverTokenizer
-
Eliminates tokens specified in the given stop words document
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.StopWordRemoverTokenizer
-
Eliminates tokens specified in the given stop words document
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.StripMarkupTokenizer
-
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using
whitespace
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.StripMarkupTokenizer
-
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using
whitespace
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.StripPunctuationTokenizer
-
Separates tokens based on punctuation and removes punctuation from tokens
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.StripPunctuationTokenizer
-
Separates tokens based on punctuation and removes punctuation from tokens
- tokenize(String, Vector<String>) - Method in class edu.georgetown.gucs.tokenizers.TokenizeFile
-
Splits the given file into tokens and alters or eliminates those tokens based on the vector of tokenizers.
- tokenize(Iterator<String>) - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
-
Alters or eliminates certain tokens.
- tokenize(Iterator<String>, Iterator<Pair<Integer, Integer>>) - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
-
Alters or eliminates certain tokens when using Splitter.
- tokenize(String) - Method in class edu.georgetown.gucs.tokenizers.Tokenizer
-
Splits the document into tokens.
- tokenize() - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Applies each tokenizer from this list, in order, on the tokens; the first tokenizer must be able to read from a
file and create the list of tokens
- TokenizeFile - Class in edu.georgetown.gucs.tokenizers
-
Tests tokenizers using a configuration file that contains a list of
Tokenizer objects to use on a
document and prints the results of the tokenization
- TokenizeFile() - Constructor for class edu.georgetown.gucs.tokenizers.TokenizeFile
-
Constructor that initializes an empty string vector of tokenizers to use
- TokenizeFile(String) - Constructor for class edu.georgetown.gucs.tokenizers.TokenizeFile
-
Constructor that sets the tokenizers from a configuration file to use on this document.
- tokenizeFile(String) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Applies each tokenizer from this list, in order, on the file; the first tokenizer must be able to read from a file
and the tokenizers must already be instantiated
- tokenizeFile(File) - Method in class edu.georgetown.gucs.tokenizers.TokenizerList
-
Applies each tokenizer from this list, in order, on the file; the first tokenizer must be able to read from a file
and the tokenizers must already be instantiated
- Tokenizer - Class in edu.georgetown.gucs.tokenizers
-
Splits a document into tokens (either by line or by word) or alters the tokens in various ways.
- Tokenizer() - Constructor for class edu.georgetown.gucs.tokenizers.Tokenizer
-
- TokenizerList - Class in edu.georgetown.gucs.tokenizers
-
An ordered list of
Tokenizer objects to split a document into tokens and alter the tokens in various
ways.
- TokenizerList() - Constructor for class edu.georgetown.gucs.tokenizers.TokenizerList
-
Constructor that initializes an empty list of tokenizers
- TokenizerList(List<String>) - Constructor for class edu.georgetown.gucs.tokenizers.TokenizerList
-
Constructor that takes a list of the tokenizer names
- TokenizerListManager - Class in edu.georgetown.gucs.tokenizers
-
- TokenizerListManager() - Constructor for class edu.georgetown.gucs.tokenizers.TokenizerListManager
-
Constructor that initializes a hashMap holding all tokenizerList objects
- tokenizers - Variable in class edu.georgetown.gucs.fingerprinter.Fingerprinter
-
- tokenVector - Variable in class edu.georgetown.gucs.tokenizers.Tokenizer
-
The list of each token in order of its appearance
- toPackedArray() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Creates and returns a copy of this antlrBitSet's bitset
- toString() - Method in class edu.georgetown.gucs.experiment.TrialParameters
-
Provides a string representation of the parameters needed for results comparisons
- toString() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Transform a bit set into a string separated by commas by formatting each element as an integer
- toString(String) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Transform a bit set into a string by formatting each element as an integer
- toString(String, List<String>) - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Creates a string representation where instead of integer elements, the ith element of a list of strings is
displayed
- toString() - Method in class edu.georgetown.gucs.utility.Pair
-
Provides a string representation of the pair in the form (a, b)
- toStringOfHalfWords() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Dumps a comma-separated list of the words making up the bit set; Splits each 64 bit number into two more manageable
32 bit numbers; Generates a comma-separated list of C++-like unsigned long constants
- toStringOfWords() - Method in class edu.georgetown.gucs.utility.AntlrBitSet
-
Dumps a comma-separated list of the words making up the bit set; Generates a comma-separated list of Java-like long
int constants
- totalDocuments - Variable in class edu.georgetown.gucs.dictionary.Dictionary
-
- Trial - Class in edu.georgetown.gucs.experiment
-
Uses a given set of parameters to create a dictionary and fingerprints of a set of documents.
- Trial(int, boolean, DBInterface, String, FileLister, File, String, long, double, double, double, double, List<String>, String, String, String, int) - Constructor for class edu.georgetown.gucs.experiment.Trial
-
Constructor that sets default values for this trial
- TrialParameters - Class in edu.georgetown.gucs.experiment
-
Provides the trial parameters that are needed for results comparisons; includes dataset, sample parameter, dictionary
parameter, min and max IDF, fingerprinter name, matcher name and minimum score, tokenizers and manglers
- TrialParameters(String, String, String, String, String, String, String, String, String, String) - Constructor for class edu.georgetown.gucs.experiment.TrialParameters
-
Constructor that specifies all the parameters for a trial
- trim(String) - Method in class edu.georgetown.gucs.dictionary.TrimDictionary
-
- trimByIDF(double, double) - Method in class edu.georgetown.gucs.dictionary.Dictionary
-
Trims this dictionary by removing any token that is outside a range of normalized IDFs.
- TrimDictionary - Class in edu.georgetown.gucs.dictionary
-
Creates a
Dictionary containing tokens with IDFs within a specified range
- TrimDictionary(double, double, String) - Constructor for class edu.georgetown.gucs.dictionary.TrimDictionary
-
Constructor that specifies the IDF range and dictionary to trim
- TRUE_NEGATIVE - Static variable in class edu.georgetown.gucs.experiment.Global
-
- TRUE_POSITIVE - Static variable in class edu.georgetown.gucs.experiment.Global
-