public class FileTokenizer extends Tokenizer
Modifier and Type | Field and Description |
---|---|
protected java.lang.String |
mode
The string specifying to split tokens based on whitespace ("tokens") or by line ("lines")
|
positions, tokenVector
Constructor and Description |
---|
FileTokenizer()
Constructor that sets the token creation mode to split based on whitespace
|
FileTokenizer(java.lang.String mode)
Constructor that sets the token creation mode.
|
Modifier and Type | Method and Description |
---|---|
void |
checkIndexing(java.lang.String filename,
int start_index,
int end_index)
Prints to the screen a portion of the given text document
|
java.lang.String |
getMode()
Provides the token creation mode
|
static void |
main(java.lang.String[] args)
Tokenizes a text file and prints the resulting tokens to the the screen
|
protected void |
readFile(java.lang.String filename)
Splits a text document into tokens.
|
void |
setMode(java.lang.String mode)
Sets the token creation mode.
|
void |
tokenize(java.lang.String filename)
Splits the document into tokens.
|
getPositionsVector, getTokenVector, iterator, position_iterator, printTokens, tokenize, tokenize
protected java.lang.String mode
public FileTokenizer()
public FileTokenizer(java.lang.String mode)
mode
- the string specifying to split tokens based on whitespace or by linepublic void setMode(java.lang.String mode)
mode
- the string specifying to split tokens based on whitespace or by linepublic java.lang.String getMode()
public void tokenize(java.lang.String filename)
protected void readFile(java.lang.String filename)
filename
- the string filename of the document to split into tokenspublic void checkIndexing(java.lang.String filename, int start_index, int end_index)
filename
- the string filename of the document to split into tokensstart_index
- the number of characters to skip in file before printing, inclusiveend_index
- the number of characters to print from the file, exclusivepublic static void main(java.lang.String[] args)
args
- array of string command line argumentsargs[0]
the filename of the English language document to split into tokens