Class | Description |
---|---|
ArabicFileTokenizer |
Splits contents of an Arabic text file into tokens using Apache Lucene's ArabicAnalyzer.
|
ChineseFileTokenizer |
Splits contents of a Chinese or Chinese-English text file into tokens using Apache Lucene's
ChineseAnalyzer or SmartChineseAnalyzer.
|
FileManglerTokenizer |
Manipulates files as they are read.
|
FileTokenizer |
Splits contents of a text file into tokens based on whitespace or by line.
|
GzippedFileTokenizer |
Splits contents of a gzipped text file into tokens based on whitespace or by line.
|
MaximumLengthTokenizer |
Eliminates tokens that are longer than a given length
|
MemoryTokenizer | |
MinimumLengthTokenizer |
Eliminates tokens that are shorter than a given length
|
OutsideInFileTokenizer |
Uses Oracle's OutsideIn technology to extract text from a file.
|
ParseTokenizers | |
PorterTokenizer |
Uses English stemming to change tokens into their root form
|
RemoveNumericTokensTokenizer |
Eliminates tokens that are numbers
|
RemoveTokensWithNumbersTokenizer |
Eliminates tokens containing numbers
|
Splitter |
Partitions a set of tokens into multiple pieces
|
StripMarkupTokenizer |
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using
whitespace
|
StripPunctuationTokenizer |
Separates tokens based on punctuation and removes punctuation from tokens
|
TikaFileTokenizer | |
Tokenizer | |
TokenizerList |
An ordered list of edu.georgetown.gucs.tokenizers objects to split a document into tokens and alter the tokens in various
ways.
|