public class Dictionary
extends java.lang.Object
implements java.io.Serializable
Modifier and Type | Field and Description |
---|---|
protected java.util.Map<java.lang.String,DictionaryEntry> |
dictionary |
protected int |
nextPosition |
static long |
serialVersionUID |
protected int |
totalDocuments |
Constructor and Description |
---|
Dictionary()
Constructor that sets default values for this dictionary.
|
Dictionary(java.lang.String dictionaryFilename)
Constructor that loads this dictionary from a file
|
Modifier and Type | Method and Description |
---|---|
void |
addDocument(java.util.Iterator<java.lang.String> documentIterator)
Adds a document to this dictionary.
|
void |
addTokenizers(java.lang.String config_file)
Reads an XML file that specifies which tokenizers to use for this dictionary
|
void |
clone(Dictionary toClone)
Creates a deep copy of the given dictionary object and stores it in this dictionary
|
boolean |
containsToken(java.lang.String token)
Determines if this dictionary contains a particular token
|
void |
createFromFileLister(FileLister fileList)
Creates this dictionary using an optional number or percent of the files in the given fileLister; if no count or
percent is specified then uses entire fileList to create dictionary
|
FileLister |
createFromFileLister(FileLister fileList,
double parameter)
Creates this dictionary using an optional number or percent of the files in the given fileLister
|
java.util.Iterator<java.lang.String> |
dictionaryTokenIterator()
Allows for operations on the tokens in this dictionary
|
void |
endTraining()
Ends training, which prohibits adding new tokens to this dictionary and allows this dictionary to be written to a
file.
|
java.lang.String |
getCreatingProgram()
Provides the program that created this dictionary
|
java.lang.String |
getCreation()
Provides the original creation date of this dictionary
|
java.lang.String |
getDictionaryFilename()
Provides the filename of this dictionary, if this dictionary was loaded from or saved to a file
|
java.lang.String |
getDictionaryName()
Provides the name of this dictionary
|
int |
getFrequency(java.lang.String token)
Provides the number of times the given token appeared when creating this dictionary
|
java.lang.String |
getGUID()
Provides the unique identifier for this dictionary
|
double |
getIDF(java.lang.String token)
Provides the inverse document frequency (IDF) of the given token in this dictionary
|
java.lang.String |
getLanguage()
Provides the language of this dictionary
|
double |
getMaxIDF()
Provides the largest IDF value in this dictionary
|
double |
getNormalizedIDF(java.lang.String token)
Provides the normalized inverse document frequency (IDF) of the given token in this dictionary.
|
int |
getPosition(java.lang.String token)
Provides the position of the given token in this dictionary
|
java.lang.String |
getRandomToken()
Gives a random token from this dictionary
|
int |
getSize()
Provides the number of tokens in this dictionary
|
java.lang.String |
getSource()
Provides the directory used for this dictionary
|
java.lang.String |
getSystemID()
Provides the identifier for the system that this dictionary was created on
|
java.util.List<java.lang.String> |
getTokenizerNames()
Provides the names of the tokenizers used to create this dictionary
|
java.util.List<java.lang.String> |
getTokens()
Provides a vector of the tokens in this dictionary
|
int |
getTotalDocuments()
Provides the number of documents processed for this dictionary
|
int |
getVersion()
Provides the version number of this dictionary.
|
void |
loadDictionary(java.lang.String filename)
Loads a previously serialized dictionary object into this dictionary
|
void |
loadDictionaryXML(java.lang.String filename)
Loads a previously serialized XML dictionary object into this dictionary
|
static void |
main(java.lang.String[] args)
Creates a dictionary from the specified directory using the provided tokenizers
|
void |
makeDictionary(java.lang.String config_file,
java.lang.String directory,
java.lang.String output_file)
Creates dictionary from the given directory; Reads an XML file that specifies which tokenizers to use and outputs
dictionary to a file
|
protected void |
mergeDictionary(java.util.Map<java.lang.String,DictionaryEntry> fileInput)
Merges the given Map of tokens to dictionaryEntry with this dictionary
|
void |
saveDictionary(java.lang.String filename)
Writes this dictionary as a serialized object.
|
void |
saveDictionaryXML(java.lang.String filename)
Writes this dictionary as a serialized XML object.
|
java.lang.String |
setCreatingProgram()
Determines the name of the program that created this dictionary
|
void |
setCreatingProgram(java.lang.String creatingProgram)
Sets the name of the program that created this dictionary
|
void |
setLanguage(java.lang.String language)
Sets the language of this dictionary
|
void |
setRandomNumberGenerator(java.util.Random random)
Sets a random number generator to allow for repeatability.
|
void |
setTokenizers(TokenizerList tokenizerList)
Loads the tokenizers to use for this dictionary from a TokenizerList object.
|
void |
setTokenizersByName(java.util.List<java.lang.String> tokenizerVec)
Sets the tokenizers used by this dictionary; Only works if no tokenizers have already been set or if the given list
contains the same tokenizers as those that have already been set
|
void |
showDictionary()
Prints information about this dictionary, including all tokens and their statistics.
|
void |
showStatistics()
Prints this dictionary statistics, including:
Name
Language
Number of documents
Number of tokens
If this dictionary has been trimmed (including the IDF range, if trimmed)
Minimum and maximum IDF
List of tokenizers
|
void |
showTokenizers()
Prints the names of the tokenizers used for this dictionary
|
void |
showTokens()
Prints each token in this dictionary with it's frequency, IDF, and normalized IDF
|
void |
startTraining(java.util.List<java.lang.String> tokenizerVec)
Starts training, which permits tokens to be added to this dictionary.
|
void |
threadingOff()
Turns off threading for creating this dictionary
|
void |
threadingOn()
Turns on threading for creating this dictionary.
|
void |
threadingOn(int numThreads)
Turns on threading for creating this dictionary.
|
void |
trimByIDF(double min,
double max)
Trims this dictionary by removing any token that is outside a range of normalized IDFs.
|
public static final long serialVersionUID
protected java.util.Map<java.lang.String,DictionaryEntry> dictionary
protected int nextPosition
protected int totalDocuments
public Dictionary()
public Dictionary(java.lang.String dictionaryFilename)
dictionaryFilename
- the filename containing a dictionarypublic void addDocument(java.util.Iterator<java.lang.String> documentIterator)
documentIterator
- the iterator over all tokens in the documentpublic void addTokenizers(java.lang.String config_file)
config_file
- the string name of the XML configuration file containing this dictionary's tokenizerspublic void clone(Dictionary toClone)
toClone
- the dictionary object to copypublic boolean containsToken(java.lang.String token)
token
- the string containing the token valueboolean
value of whether the token is present in this dictionarypublic void createFromFileLister(FileLister fileList)
fileList
- the fileLister containing the files to use for this dictionarypublic FileLister createFromFileLister(FileLister fileList, double parameter)
fileList
- the fileLister containing the files to use for this dictionaryparameter
- the double
count (>0) or percent (<0) of the files in the fileLister to use to create this
dictionarypublic java.util.Iterator<java.lang.String> dictionaryTokenIterator()
public void endTraining()
public java.lang.String getCreatingProgram()
public java.lang.String getCreation()
public java.lang.String getDictionaryFilename()
public java.lang.String getDictionaryName()
public int getFrequency(java.lang.String token)
token
- the string containing the token valuepublic java.lang.String getGUID()
public double getIDF(java.lang.String token)
token
- the string containing the token valuedouble
value of the token's inverse document frequency (IDF); returns 0.0 if the token is
not present in this dictionarypublic java.lang.String getLanguage()
public double getMaxIDF()
double
value of the largest IDF in this dictionarypublic double getNormalizedIDF(java.lang.String token)
token
- the string containing the token valuedouble
value of the token's normalized inverse document frequency (IDF); returns 0.0 if
the token is not present in this dictionarypublic int getPosition(java.lang.String token)
token
- the string containing the token valuepublic java.lang.String getRandomToken()
public int getSize()
public java.lang.String getSource()
public java.lang.String getSystemID()
public java.util.List<java.lang.String> getTokenizerNames()
public java.util.List<java.lang.String> getTokens()
public int getTotalDocuments()
public int getVersion()
public void loadDictionary(java.lang.String filename)
filename
- the string filename of the dictionary file to loadpublic void loadDictionaryXML(java.lang.String filename)
filename
- the string filename of the XML dictionary file to loadpublic void makeDictionary(java.lang.String config_file, java.lang.String directory, java.lang.String output_file)
config_file
- the string name of the XML configuration file containing dictionary tokenizersdirectory
- the string path to the directory used to create this dictionaryoutput_file
- the string name/path to write this dictionary toprotected void mergeDictionary(java.util.Map<java.lang.String,DictionaryEntry> fileInput)
fileInput
- the map containing the dictionary information to merge with this dictionarypublic void saveDictionary(java.lang.String filename)
filename
- the string filename to write this dictionary topublic void saveDictionaryXML(java.lang.String filename)
filename
- the string XML filename to write this dictionary topublic java.lang.String setCreatingProgram()
public void setCreatingProgram(java.lang.String creatingProgram)
creatingProgram
- the string program that created this dictionarypublic void setLanguage(java.lang.String language)
language
- the string language of the files used to create this dictionarypublic void setRandomNumberGenerator(java.util.Random random)
random
- the random object to use for this dictionarypublic void setTokenizers(TokenizerList tokenizerList)
tokenizerList
- the TokenizerList object containing the tokenizers to use for this dictionarypublic void setTokenizersByName(java.util.List<java.lang.String> tokenizerVec)
tokenizerVec
- the list of tokenizer names to use to create this dictionarypublic void showDictionary()
public void showStatistics()
public void showTokenizers()
public void showTokens()
public void startTraining(java.util.List<java.lang.String> tokenizerVec)
tokenizerVec
- the string list containing tokenizer names; must match the tokenizers used to create this dictionarypublic void threadingOff()
public void threadingOn()
public void threadingOn(int numThreads)
numThreads
- the number of threads to use to create this dictionarypublic void trimByIDF(double min, double max)
min
- the double
minimum normalized IDF to keep; any token with a lower IDF will be discarded.max
- the double
maximum normalized IDF to keep; any token with a higher IDF will be discarded.public static void main(java.lang.String[] args)
args
- array of string command line argumentsargs[0]
the filename of the tokenizer configuration fileargs[1]
the directory to create the dictionary fromargs[2]
the filename to write the dictionary to