public class ChineseFileTokenizer extends FileTokenizer
mode
positions, tokenVector
Constructor and Description |
---|
ChineseFileTokenizer()
Constructor that sets the token creation mode to split by smart tokenization using probabilistic word segmentation
|
ChineseFileTokenizer(java.lang.String mode)
Constructor that sets the token creation mode.
|
Modifier and Type | Method and Description |
---|---|
static void |
main(java.lang.String[] args)
Tokenizes a Chinese text file into single characters and using a probabilistic model and prints both sets of
resulting tokens to the the screen
|
void |
setMode(java.lang.String mode)
Sets the token creation mode.
|
void |
tokenize(java.lang.String filename)
Splits a Chinese or mixed Chinese-English document into tokens based on the token creation mode.
|
checkIndexing, getMode, readFile
getPositionsVector, getTokenVector, iterator, position_iterator, printTokens, tokenize, tokenize
public ChineseFileTokenizer()
SmartChineseAnalyzer
public ChineseFileTokenizer(java.lang.String mode)
mode
- the string mode to split tokens by one Chinese character as one word or by smart tokenization using
probabilistic word segmentationpublic void setMode(java.lang.String mode)
setMode
in class FileTokenizer
mode
- the string mode to split tokens by one Chinese character as one word or by smart tokenization using
probabilistic word segmentationpublic void tokenize(java.lang.String filename)
tokenize
in class FileTokenizer
filename
- the string filename of the document to split into tokensChineseAnalyzer
,
SmartChineseAnalyzer
public static void main(java.lang.String[] args)
args
- array of string command line argumentsargs[0]
the filename of the Chinese language document to split into tokens