edu.georgetown.gucs.tokenizers
Class OutsideInTokenizer

java.lang.Object
  extended by edu.georgetown.gucs.tokenizers.Tokenizer
      extended by edu.georgetown.gucs.tokenizers.OutsideInTokenizer

public class OutsideInTokenizer
extends Tokenizer

Uses Oracle's OutsideIn technology to extract text from a file. Splits contents of the file into tokens based on whitespace or by line. Used for Dictionary creation.

Author:
Lindsay Neubauer

Field Summary
 
Fields inherited from class edu.georgetown.gucs.tokenizers.Tokenizer
constructor, tokenVector
 
Constructor Summary
OutsideInTokenizer()
          Constructor that sets the token creation mode to split based on whitespace
OutsideInTokenizer(java.lang.String mode)
          Constructor that sets the token creation mode.
 
Method Summary
static void main(java.lang.String[] args)
           
 void tokenize(java.lang.String filename)
          Splits the document into tokens.
 
Methods inherited from class edu.georgetown.gucs.tokenizers.Tokenizer
getConstructor, iterator, printTokens, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

OutsideInTokenizer

public OutsideInTokenizer()
Constructor that sets the token creation mode to split based on whitespace


OutsideInTokenizer

public OutsideInTokenizer(java.lang.String mode)
Constructor that sets the token creation mode. If set to "lines" the file is split by line; if set to "tokens" the file is split based on whitespace.

Parameters:
mode - the string mode to split tokens based on whitespace or by line
Method Detail

tokenize

public void tokenize(java.lang.String filename)
Description copied from class: Tokenizer
Splits the document into tokens. By default, this method calls the ExceptionHandler to throw a "Tokenizer does not support reading from a file" exception. It must be over-ridden in each tokenizer that supports reading from a file.

Overrides:
tokenize in class Tokenizer
Parameters:
filename - the filename of the document to split into tokens

main

public static void main(java.lang.String[] args)