public class StripMarkupTokenizer extends Tokenizer
positions, tokenVector
Constructor and Description |
---|
StripMarkupTokenizer() |
StripMarkupTokenizer(java.lang.String keepScript)
Constructor that specifies whether to keep tokens nested inside
<\script> tags; the default is to
eliminate these tokens |
Modifier and Type | Method and Description |
---|---|
void |
tokenize(java.util.Iterator<java.lang.String> iterator)
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using
whitespace
|
void |
tokenize(java.util.Iterator<java.lang.String> tokensIterator,
java.util.Iterator<Pair<java.lang.Integer,java.lang.Integer>> positionsIterator)
Eliminates tokens nested inside markup language tags; assumes that tokens have been split by line rather than using
whitespace
|
getPositionsVector, getTokenVector, iterator, position_iterator, printTokens, tokenize
public StripMarkupTokenizer()
public StripMarkupTokenizer(java.lang.String keepScript)
<\script>
tags; the default is to
eliminate these tokenskeepScript
- the string value (true or false) specifying whether to keep tokens nested inside <\script>
tagspublic void tokenize(java.util.Iterator<java.lang.String> iterator)
public void tokenize(java.util.Iterator<java.lang.String> tokensIterator, java.util.Iterator<Pair<java.lang.Integer,java.lang.Integer>> positionsIterator)