public class ChineseFileTokenizer extends FileTokenizer
mode_
tokenVectorMap
Constructor and Description |
---|
ChineseFileTokenizer()
Constructor that sets the token creation mode to split by smart tokenization using probabilistic word segmentation
|
ChineseFileTokenizer(java.lang.String mode)
Constructor that sets the token creation mode.
|
Modifier and Type | Method and Description |
---|---|
void |
setMode(java.lang.String mode)
Sets the token creation mode.
|
java.util.List<Token> |
tokenize(java.lang.String filename)
Splits a Chinese or mixed Chinese-English document into tokens based on the token creation mode.
|
addTokenizers, readFile, tokenizeFile
getTokenVectorMap, iterator, printTokens, tokenize, toString
public ChineseFileTokenizer()
SmartChineseAnalyzer
public ChineseFileTokenizer(java.lang.String mode)
mode
- the string mode to split tokens by one Chinese character as one word or by smart tokenization using
probabilistic word segmentationpublic void setMode(java.lang.String mode)
setMode
in class FileTokenizer
mode
- the string mode to split tokens by one Chinese character as one word or by smart tokenization using
probabilistic word segmentationpublic java.util.List<Token> tokenize(java.lang.String filename)
tokenize
in class FileTokenizer
filename
- the string filename of the document to split into tokensChineseAnalyzer
,
SmartChineseAnalyzer