edu.georgetown.gucs.tokenizers
Class ChineseTokenizer
java.lang.Object
edu.georgetown.gucs.tokenizers.Tokenizer
edu.georgetown.gucs.tokenizers.ChineseTokenizer
public class ChineseTokenizer
- extends Tokenizer
Splits contents of a Chinese or Chinese-English text file into tokens using Apache Lucene's ChineseAnalyzer
or SmartChineseAnalyzer. Tokens are used for Dictionary creation.
- Author:
- Lindsay Neubauer
Constructor Summary |
ChineseTokenizer()
Constructor that sets the token creation mode to split by smart tokenization using probabilistic word
segmentation |
ChineseTokenizer(java.lang.String mode)
Constructor that sets the token creation mode. |
Method Summary |
static void |
main(java.lang.String[] args)
|
void |
setMode(java.lang.String mode)
Sets the token creation mode. |
void |
tokenize(java.lang.String filename)
Splits a Chinese or mixed Chinese-English document into tokens based on the token creation mode. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ChineseTokenizer
public ChineseTokenizer()
- Constructor that sets the token creation mode to split by smart tokenization using probabilistic word
segmentation
ChineseTokenizer
public ChineseTokenizer(java.lang.String mode)
- Constructor that sets the token creation mode. If set to "individual" the file is split by one Chinese
character as one Chinese word; if set to "smart" the file is split by smart tokenization using
probabilistic word segmentation.
- Parameters:
mode
- the string mode to split tokens by one Chinese character as one word or by smart tokenization
using probabilistic word segmentation
setMode
public void setMode(java.lang.String mode)
- Sets the token creation mode. If set to "individual" the file is split by one Chinese character as one Chinese
word; if set to "smart" the file is split by smart tokenization using probabilistic word segmentation.
- Parameters:
mode
- the string mode to split tokens by one Chinese character as one word or by smart tokenization
using probabilistic word segmentation
tokenize
public void tokenize(java.lang.String filename)
- Splits a Chinese or mixed Chinese-English document into tokens based on the token creation mode. If mode is
set to "individual" the file is split by one Chinese character as one Chinese word; if set to "smart" the file is split
by smart tokenization using probabilistic word segmentation. Assumes the file is UTF-8 encoded and Version.LUCENE_30
for "smart" tokenization.
- Overrides:
tokenize
in class Tokenizer
- Parameters:
filename
- the string filename of the document to split into tokens
main
public static void main(java.lang.String[] args)