Class | Description |
---|---|
ArabicFileTokenizer |
Splits contents of an Arabic text file into tokens using Apache Lucene's ArabicAnalyzer.
|
ChineseFileTokenizer |
Splits contents of a Chinese or Chinese-English text file into tokens using Apache Lucene's
ChineseAnalyzer or SmartChineseAnalyzer.
|
EnronEmailTokenizer |
Eliminates tokens related to attachments in the Enron data set
|
EnronStripMailHeaderTokenizer |
Eliminates tokens from headers in the Enron data set
|
FileManglerTokenizer |
Manipulates files as they are read.
|
FileTokenizer |
Splits contents of a text file into tokens based on whitespace or by line.
|
GzippedFileTokenizer |
Splits contents of a gzipped text file into tokens based on whitespace or by line.
|
MaximumLengthTokenizer |
Eliminates tokens that are longer than a given length
|
MinimumLengthTokenizer |
Eliminates tokens that are shorter than a given length
|
OutsideInFileTokenizer |
Uses Oracle's OutsideIn technology to extract text from a file.
|
PorterTokenizer |
Uses English stemming to change tokens into their root form
|
RemoveNumericTokensTokenizer |
Eliminates tokens that are numbers
|
RemoveTokensWithNumbersTokenizer |
Eliminates tokens containing numbers
|
StopWordRemoverTokenizer |
Eliminates tokens specified in the given stop words document; Assumes this document contains lower case,
non-porterized English word
|
StripMarkupTokenizer |
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using
whitespace
|
StripPunctuationTokenizer |
Separates tokens based on punctuation and removes punctuation from tokens
|
TokenizeFile |
Tests tokenizers using a configuration file that contains a list of Tokenizer objects to use on a
document and prints the results of the tokenization
|
Tokenizer |
Splits a document into tokens (either by line or by word) or alters the tokens in various ways.
|
TokenizerList |
An ordered list of Tokenizer objects to split a document into tokens and alter the tokens in various
ways.
|
TokenizerListManager |
Ensures that TokenizerList objects are only created once.
|