edu.georgetown.gucs.tokenizers
Class OutsideInTokenizer
java.lang.Object
edu.georgetown.gucs.tokenizers.Tokenizer
edu.georgetown.gucs.tokenizers.OutsideInTokenizer
public class OutsideInTokenizer
- extends Tokenizer
Uses Oracle's OutsideIn technology to extract text from a file. Splits contents of the file
into tokens based on whitespace or by line. Used for Dictionary creation.
- Author:
- Lindsay Neubauer
Constructor Summary |
OutsideInTokenizer()
Constructor that sets the token creation mode to split based on whitespace |
OutsideInTokenizer(java.lang.String mode)
Constructor that sets the token creation mode. |
Method Summary |
static void |
main(java.lang.String[] args)
|
void |
tokenize(java.lang.String filename)
Splits the document into tokens. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
OutsideInTokenizer
public OutsideInTokenizer()
- Constructor that sets the token creation mode to split based on whitespace
OutsideInTokenizer
public OutsideInTokenizer(java.lang.String mode)
- Constructor that sets the token creation mode. If set to "lines" the file is split by line;
if set to "tokens" the file is split based on whitespace.
- Parameters:
mode
- the string mode to split tokens based on whitespace or by line
tokenize
public void tokenize(java.lang.String filename)
- Description copied from class:
Tokenizer
- Splits the document into tokens. By default, this method calls the ExceptionHandler to throw
a "Tokenizer does not support reading from a file" exception. It must be over-ridden in each
tokenizer that supports reading from a file.
- Overrides:
tokenize
in class Tokenizer
- Parameters:
filename
- the filename of the document to split into tokens
main
public static void main(java.lang.String[] args)