edu.georgetown.gucs.tokenizers
Class ArabicTokenizer

java.lang.Object
  extended by edu.georgetown.gucs.tokenizers.Tokenizer
      extended by edu.georgetown.gucs.tokenizers.ArabicTokenizer

public class ArabicTokenizer
extends Tokenizer

Splits contents of an Arabic text file into tokens using Apache Lucene's ArabicAnalyzer. Tokens are used for Dictionary creation.

Author:
Lindsay Neubauer

Field Summary
 
Fields inherited from class edu.georgetown.gucs.tokenizers.Tokenizer
constructor, tokenVector
 
Constructor Summary
ArabicTokenizer()
           
 
Method Summary
static void main(java.lang.String[] args)
           
 void tokenize(java.lang.String filename)
          Splits the document into tokens.
 
Methods inherited from class edu.georgetown.gucs.tokenizers.Tokenizer
getConstructor, iterator, printTokens, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ArabicTokenizer

public ArabicTokenizer()
Method Detail

tokenize

public void tokenize(java.lang.String filename)
Splits the document into tokens. Assumes the file is UTF-8 encoded and Version.LUCENE_30 for the ArabicAnalyzer.

Overrides:
tokenize in class Tokenizer
Parameters:
filename - the string filename of the document to split into tokens
See Also:
ArabicAnalyzer

main

public static void main(java.lang.String[] args)