public class GzippedFileTokenizer extends FileTokenizer
mode
positions, tokenVector
Constructor and Description |
---|
GzippedFileTokenizer()
Constructor that sets the token creation mode to split based on whitespace
|
GzippedFileTokenizer(java.lang.String mode)
Constructor that sets the token creation mode.
|
Modifier and Type | Method and Description |
---|---|
void |
checkIndexing(java.lang.String filename,
int start_index,
int end_index)
Prints to the screen a portion of the given gzipped text document
|
static void |
main(java.lang.String[] args)
Tokenizes a gzipped text file and prints the resulting tokens to the the screen
|
protected void |
readFile(java.lang.String filename)
Splits a gzipped text document into tokens.
|
void |
tokenize(java.lang.String filename)
Splits a gzipped text document into tokens.
|
getMode, setMode
getPositionsVector, getTokenVector, iterator, position_iterator, printTokens, tokenize, tokenize
public GzippedFileTokenizer()
public GzippedFileTokenizer(java.lang.String mode)
mode
- the string mode to split tokens based on whitespace or by linepublic void tokenize(java.lang.String filename)
tokenize
in class FileTokenizer
filename
- the string filename of the document to split into tokensprotected void readFile(java.lang.String filename)
readFile
in class FileTokenizer
filename
- the string filename of the document to split into tokenspublic void checkIndexing(java.lang.String filename, int start_index, int end_index)
checkIndexing
in class FileTokenizer
filename
- the string filename of the document to split into tokensstart_index
- the number of characters to skip in file before printing, inclusiveend_index
- the number of characters to print from the file, exclusivepublic static void main(java.lang.String[] args)
args
- array of string command line argumentsargs[0]
the filename of the English language gzipped document to split into tokens