public class Tokenizer
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
protected java.util.Vector<Pair<java.lang.Integer,java.lang.Integer>> |
positions
The list of integer pairs corresponding to the position of each token
|
protected java.util.Vector<java.lang.String> |
tokenVector
The list of each token in order of its appearance
|
Constructor and Description |
---|
Tokenizer() |
Modifier and Type | Method and Description |
---|---|
java.util.List<Pair<java.lang.Integer,java.lang.Integer>> |
getPositionsVector()
Provides the list of integer pairs corresponding to the position of each token
|
java.util.List<java.lang.String> |
getTokenVector()
Provides the list of each token in order of its appearance
|
java.util.Iterator<java.lang.String> |
iterator()
Provides an iterator over the tokens in this tokenizer
|
java.util.Iterator<Pair<java.lang.Integer,java.lang.Integer>> |
position_iterator()
Provides an iterator over the positions of elements in this tokenizer
|
static void |
printTokens(java.util.Iterator<java.lang.String> tokens)
Prints each token to the system output stream using the UTF-8 charset
|
void |
tokenize(java.util.Iterator<java.lang.String> iterator)
Alters or eliminates certain tokens.
|
void |
tokenize(java.util.Iterator<java.lang.String> tokensIterator,
java.util.Iterator<Pair<java.lang.Integer,java.lang.Integer>> positionsIterator)
Alters or eliminates certain tokens when using Splitter.
|
void |
tokenize(java.lang.String filename)
Splits the document into tokens.
|
protected java.util.Vector<java.lang.String> tokenVector
protected java.util.Vector<Pair<java.lang.Integer,java.lang.Integer>> positions
public java.util.List<java.lang.String> getTokenVector()
public java.util.List<Pair<java.lang.Integer,java.lang.Integer>> getPositionsVector()
public void tokenize(java.util.Iterator<java.lang.String> iterator)
iterator
- the string iterator over the token elementspublic void tokenize(java.util.Iterator<java.lang.String> tokensIterator, java.util.Iterator<Pair<java.lang.Integer,java.lang.Integer>> positionsIterator)
tokensIterator
- the string iterator over the token elementspositionsIterator
- the integer pair iterator over the start and end position elementspublic void tokenize(java.lang.String filename)
filename
- the filename of the document to split into tokenspublic java.util.Iterator<java.lang.String> iterator()
public java.util.Iterator<Pair<java.lang.Integer,java.lang.Integer>> position_iterator()
public static void printTokens(java.util.Iterator<java.lang.String> tokens)
tokens
- the string iterator over the tokens in this tokenizer