Package | Description |
---|---|
edu.georgetown.gucs.dictionary | |
edu.georgetown.gucs.fingerprinter | |
edu.georgetown.gucs.tokenizers |
Modifier and Type | Method and Description |
---|---|
Token |
DictionaryEntry.getToken()
Provides the token
|
Modifier and Type | Method and Description |
---|---|
java.util.List<Token> |
Dictionary.getTokens()
Provides a vector of the tokens in this dictionary
|
java.util.Iterator<java.util.Map.Entry<Token,DictionaryEntry>> |
Dictionary.iterator()
Allows for operations on the tokens in this dictionary
|
Modifier and Type | Method and Description |
---|---|
boolean |
Dictionary.containsToken(Token token)
Determines if this dictionary contains a particular token
|
double |
Dictionary.getNormalizedIDF(Token token)
Provides the normalized inverse document frequency (IDF) of the given token in this dictionary.
|
int |
Dictionary.getPosition(Token token)
Provides the position of the given token in this dictionary
|
Modifier and Type | Method and Description |
---|---|
protected void |
Dictionary.mergeDictionary(java.util.Map<Token,DictionaryEntry> documentTokens)
Merges the given Map of tokens to dictionaryEntry with this dictionary
|
Modifier and Type | Method and Description |
---|---|
java.util.List<Token> |
Fingerprint.getTokenList() |
Modifier and Type | Method and Description |
---|---|
LongFastBloomFilter |
Fingerprinter.addBloomFilter(java.util.List<Token> tokenList)
Create a bloomFilter using a set of tokens
|
Constructor and Description |
---|
Fingerprint(java.lang.String fileName,
byte[] fingerprint,
java.util.List<Token> tokenList,
Dictionary dict,
int start_byte,
int end_byte)
Defualt constructor- fingerprints are byte arrays that contain tokenized
dictionary information
|
Fingerprint(java.lang.String fileName,
byte[] fingerprint,
java.util.List<Token> tokenList,
int start_byte,
int end_byte)
Default Constructor- fingerprints are byte arrays that contain tokenized
dictionary information
|
Fingerprint(java.lang.String fileName,
java.lang.String sdhash,
java.util.List<Token> tokenList,
int start_byte,
int end_byte)
Defualt constructor- fingerprints are byte arrays that contain tokenized
dictionary information
|
Modifier and Type | Field and Description |
---|---|
protected java.util.Map<java.lang.String,java.util.List<Token>> |
Tokenizer.tokenVectorMap
The map containing the splitter name and its corresponding list of token elements
|
Modifier and Type | Method and Description |
---|---|
java.util.Map<java.lang.String,java.util.List<Token>> |
Tokenizer.getTokenVectorMap()
Provides the list of each token in order of its appearance
|
java.util.Iterator<java.util.Map.Entry<java.lang.String,java.util.List<Token>>> |
Tokenizer.iterator() |
java.util.List<Token> |
ParseTokenizers.readFile() |
java.util.List<Token> |
TikaFileTokenizer.readFile(java.lang.String fileName)
readfile extends FileTokenizer to
extract text from non-traditional text files
|
protected java.util.List<Token> |
OutsideInFileTokenizer.readFile(java.lang.String filename)
Calls Oracle's Outside In API to extract text from a file and splits that text into tokens
|
protected java.util.List<Token> |
GzippedFileTokenizer.readFile(java.lang.String filename)
Splits a gzipped text document into tokens.
|
protected java.util.List<Token> |
FileTokenizer.readFile(java.lang.String filename)
Splits a text document into tokens
|
java.util.List<Token> |
ParseTokenizers.readString() |
java.util.List<Token> |
Tokenizer.tokenize(java.util.List<Token> tokenVector)
Alters or eliminates certain tokens.
|
java.util.List<Token> |
StripPunctuationTokenizer.tokenize(java.util.List<Token> tokens)
Separates tokens based on punctuation and removes punctuation from tokens
|
java.util.List<Token> |
StripMarkupTokenizer.tokenize(java.util.List<Token> tokens)
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using
whitespace
|
java.util.List<Token> |
RemoveTokensWithNumbersTokenizer.tokenize(java.util.List<Token> tokens)
Eliminates tokens containing numbers
|
java.util.List<Token> |
RemoveNumericTokensTokenizer.tokenize(java.util.List<Token> tokens)
Eliminates tokens that are numbers
|
java.util.List<Token> |
PorterTokenizer.tokenize(java.util.List<Token> tokens)
Changes English language tokens into their root form
|
java.util.List<Token> |
MinimumLengthTokenizer.tokenize(java.util.List<Token> tokens)
Eliminates tokens that are shorter than the length specified in the constructor
|
java.util.List<Token> |
MaximumLengthTokenizer.tokenize(java.util.List<Token> tokens)
Eliminates tokens that are longer than the length specified in the constructor
|
java.util.List<Token> |
FileManglerTokenizer.tokenize(java.util.List<Token> tokens)
Alters or eliminates certain tokens based on the given mangler settings
|
java.util.List<Token> |
Tokenizer.tokenize(java.lang.String filename)
Splits the document into tokens.
|
java.util.List<Token> |
MemoryTokenizer.tokenize(java.lang.String str) |
java.util.List<Token> |
GzippedFileTokenizer.tokenize(java.lang.String filename)
Splits a gzipped text document into tokens
|
java.util.List<Token> |
FileTokenizer.tokenize(java.lang.String filename) |
java.util.List<Token> |
ChineseFileTokenizer.tokenize(java.lang.String filename)
Splits a Chinese or mixed Chinese-English document into tokens based on the token creation mode.
|
java.util.List<Token> |
ArabicFileTokenizer.tokenize(java.lang.String filename)
Splits the document into tokens.
|
java.util.Map<java.lang.String,java.util.List<Token>> |
Splitter.tokenizeEachEntry(java.util.List<Tokenizer> tokenizers)
Tokenizes each splitter entry with the list of tokenizers; assumes no FileTokenizers are in the list
|
java.util.Map<java.lang.String,java.util.List<Token>> |
TokenizerList.tokenizeFile(java.lang.String filename)
Applies each tokenizer from this list, in order, on the file; the first tokenizer must be able to read from a file
and the tokenizers must already be instantiated
|
java.util.Map<java.lang.String,java.util.List<Token>> |
TokenizerList.tokenizeString(java.lang.String str) |
Modifier and Type | Method and Description |
---|---|
void |
FileManglerTokenizer.changeToChar(java.util.Vector<Token> newTokenVector) |
void |
FileManglerTokenizer.changeToken(java.util.Vector<Token> newTokenVector) |
void |
FileManglerTokenizer.changeToPunc(java.util.Vector<Token> newTokenVector) |
void |
FileManglerTokenizer.deleteChar(java.util.Vector<Token> newVectorToken) |
void |
FileManglerTokenizer.deleteWhiteSpace(java.util.Vector<Token> newTokenVector) |
void |
TokenizerList.enableMangler(java.lang.String settings,
java.util.List<Token> tokens)
Enables the mangler with the given settings in the FileManglerTokenizer object in this list
|
void |
FileManglerTokenizer.enableMangler(java.lang.String manglers,
java.util.List<Token> tokenList)
Enables the mangler with the given settings in this tokenizer
|
void |
FileManglerTokenizer.splitToken(java.util.Vector<Token> newTokenVector) |
void |
FileManglerTokenizer.summaryReduce(java.util.Vector<Token> newtokenVector) |
java.util.List<Token> |
Tokenizer.tokenize(java.util.List<Token> tokenVector)
Alters or eliminates certain tokens.
|
java.util.List<Token> |
StripPunctuationTokenizer.tokenize(java.util.List<Token> tokens)
Separates tokens based on punctuation and removes punctuation from tokens
|
java.util.List<Token> |
StripMarkupTokenizer.tokenize(java.util.List<Token> tokens)
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using
whitespace
|
java.util.List<Token> |
RemoveTokensWithNumbersTokenizer.tokenize(java.util.List<Token> tokens)
Eliminates tokens containing numbers
|
java.util.List<Token> |
RemoveNumericTokensTokenizer.tokenize(java.util.List<Token> tokens)
Eliminates tokens that are numbers
|
java.util.List<Token> |
PorterTokenizer.tokenize(java.util.List<Token> tokens)
Changes English language tokens into their root form
|
java.util.List<Token> |
MinimumLengthTokenizer.tokenize(java.util.List<Token> tokens)
Eliminates tokens that are shorter than the length specified in the constructor
|
java.util.List<Token> |
MaximumLengthTokenizer.tokenize(java.util.List<Token> tokens)
Eliminates tokens that are longer than the length specified in the constructor
|
java.util.List<Token> |
FileManglerTokenizer.tokenize(java.util.List<Token> tokens)
Alters or eliminates certain tokens based on the given mangler settings
|
Constructor and Description |
---|
Splitter(java.lang.String splitter,
java.util.List<Token> list)
Constructor that takes a string indicating the type and percent for the splitter and a list of tokens to split
|