edu.georgetown.gucs.tokenizers
Class ChineseTokenizer

java.lang.Object
  extended by edu.georgetown.gucs.tokenizers.Tokenizer
      extended by edu.georgetown.gucs.tokenizers.ChineseTokenizer

public class ChineseTokenizer
extends Tokenizer

Splits contents of a Chinese or Chinese-English text file into tokens using Apache Lucene's ChineseAnalyzer or SmartChineseAnalyzer. Tokens are used for Dictionary creation.

Author:
Lindsay Neubauer

Field Summary
 
Fields inherited from class edu.georgetown.gucs.tokenizers.Tokenizer
constructor, tokenVector
 
Constructor Summary
ChineseTokenizer()
          Constructor that sets the token creation mode to split by smart tokenization using probabilistic word segmentation
ChineseTokenizer(java.lang.String mode)
          Constructor that sets the token creation mode.
 
Method Summary
static void main(java.lang.String[] args)
           
 void setMode(java.lang.String mode)
          Sets the token creation mode.
 void tokenize(java.lang.String filename)
          Splits a Chinese or mixed Chinese-English document into tokens based on the token creation mode.
 
Methods inherited from class edu.georgetown.gucs.tokenizers.Tokenizer
getConstructor, iterator, printTokens, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ChineseTokenizer

public ChineseTokenizer()
Constructor that sets the token creation mode to split by smart tokenization using probabilistic word segmentation


ChineseTokenizer

public ChineseTokenizer(java.lang.String mode)
Constructor that sets the token creation mode. If set to "individual" the file is split by one Chinese character as one Chinese word; if set to "smart" the file is split by smart tokenization using probabilistic word segmentation.

Parameters:
mode - the string mode to split tokens by one Chinese character as one word or by smart tokenization using probabilistic word segmentation
Method Detail

setMode

public void setMode(java.lang.String mode)
Sets the token creation mode. If set to "individual" the file is split by one Chinese character as one Chinese word; if set to "smart" the file is split by smart tokenization using probabilistic word segmentation.

Parameters:
mode - the string mode to split tokens by one Chinese character as one word or by smart tokenization using probabilistic word segmentation

tokenize

public void tokenize(java.lang.String filename)
Splits a Chinese or mixed Chinese-English document into tokens based on the token creation mode. If mode is set to "individual" the file is split by one Chinese character as one Chinese word; if set to "smart" the file is split by smart tokenization using probabilistic word segmentation. Assumes the file is UTF-8 encoded and Version.LUCENE_30 for "smart" tokenization.

Overrides:
tokenize in class Tokenizer
Parameters:
filename - the string filename of the document to split into tokens

main

public static void main(java.lang.String[] args)