Package | Description |
---|---|
edu.georgetown.gucs.tokenizers |
Modifier and Type | Class and Description |
---|---|
class |
ArabicFileTokenizer
Splits contents of an Arabic text file into tokens using Apache Lucene's ArabicAnalyzer.
|
class |
ChineseFileTokenizer
Splits contents of a Chinese or Chinese-English text file into tokens using Apache Lucene's
ChineseAnalyzer or SmartChineseAnalyzer.
|
class |
FileManglerTokenizer
Manipulates files as they are read.
|
class |
FileTokenizer
Splits contents of a text file into tokens based on whitespace or by line.
|
class |
GzippedFileTokenizer
Splits contents of a gzipped text file into tokens based on whitespace or by line.
|
class |
MaximumLengthTokenizer
Eliminates tokens that are longer than a given length
|
class |
MemoryTokenizer |
class |
MinimumLengthTokenizer
Eliminates tokens that are shorter than a given length
|
class |
OutsideInFileTokenizer
Uses Oracle's OutsideIn technology to extract text from a file.
|
class |
ParseTokenizers |
class |
PorterTokenizer
Uses English stemming to change tokens into their root form
|
class |
RemoveNumericTokensTokenizer
Eliminates tokens that are numbers
|
class |
RemoveTokensWithNumbersTokenizer
Eliminates tokens containing numbers
|
class |
Splitter
Partitions a set of tokens into multiple pieces
|
class |
StripMarkupTokenizer
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using
whitespace
|
class |
StripPunctuationTokenizer
Separates tokens based on punctuation and removes punctuation from tokens
|
class |
TikaFileTokenizer |
class |
TokenizerList
An ordered list of edu.georgetown.gucs.tokenizers objects to split a document into tokens and alter the tokens in various
ways.
|
Modifier and Type | Method and Description |
---|---|
java.util.List<Tokenizer> |
TokenizerList.getTokenizers() |
Modifier and Type | Method and Description |
---|---|
java.util.Map<java.lang.String,java.util.List<Token>> |
Splitter.tokenizeEachEntry(java.util.List<Tokenizer> tokenizers)
Tokenizes each splitter entry with the list of tokenizers; assumes no FileTokenizers are in the list
|