public class Dictionary extends java.lang.Object implements java.io.Serializable, java.lang.Iterable<java.util.Map.Entry<Token,DictionaryEntry>>
Constructor and Description |
---|
Dictionary()
Constructor that sets default values for this dictionary
|
Dictionary(java.lang.String dictionaryFilename)
Constructor that loads this dictionary from a file
|
Modifier and Type | Method and Description |
---|---|
void |
clone(Dictionary anotherDictionary)
Creates a deep copy of the given dictionary object and stores it in this dictionary
|
boolean |
containsToken(Token token)
Determines if this dictionary contains a particular token
|
void |
createPartialDictionary(java.util.List<java.io.File> dictionaryFiles,
java.io.File directory)
Creates this dictionary using the optional count or percent in the given fileLister; if no count or percent is
specified then uses entire fileLister to create dictionary
|
java.lang.String |
getCreatingProgram()
Provides the program that created this dictionary
|
java.lang.String |
getCreation()
Provides the original creation date of this dictionary
|
java.lang.String |
getDictionaryFilename()
Provides the filename of this dictionary, if this dictionary was loaded from or saved to a file
|
java.lang.String |
getDictionaryName()
Provides the name of this dictionary
|
java.lang.String |
getGUID()
Provides the unique identifier for this dictionary
|
double |
getMaxIDF()
Provides the largest IDF value in this dictionary
|
double |
getMaxIDFFound()
Provides the largest IDF value in this dictionary
|
int |
getMaxThread()
Provides the maximum thread count to use for this dictionary
|
double |
getMinIDF()
Provides the largest IDF value in this dictionary
|
double |
getNormalizedIDF(DictionaryEntry entry)
Provides the normalized inverse document frequency (IDF) of the given dictionaryEntry.
|
double |
getNormalizedIDF(Token token)
Provides the normalized inverse document frequency (IDF) of the given token in this dictionary.
|
int |
getPosition(Token token)
Provides the position of the given token in this dictionary
|
java.lang.String |
getSource()
Provides the directory used for this dictionary
|
java.lang.String |
getStatistics()
Provides this dictionary statistics, including:
Name
Number of documents
Number of tokens
If this dictionary has been trimmed (including the IDF range, if trimmed)
Minimum and maximum IDF
List of tokenizers
|
java.lang.String |
getSystemID()
Provides the identifier for the system that this dictionary was created on
|
java.util.List<java.lang.String> |
getTokenizerNames()
Provides the names of the tokenizers
|
TokenizerList |
getTokenizers()
Provides the tokenizerList object used to create this dictionary
|
java.util.List<Token> |
getTokens()
Provides a vector of the tokens in this dictionary
|
int |
getTotalDocuments()
Provides the number of documents processed for this dictionary
|
java.util.Iterator<java.util.Map.Entry<Token,DictionaryEntry>> |
iterator()
Allows for operations on the tokens in this dictionary
|
void |
makeDictionary(java.util.List<java.lang.String> tokenizersNames,
java.lang.String path,
double min_idf,
double max_idf,
java.lang.String output_file)
Creates dictionary from the given directory; Reads an XML file that specifies which tokenizers to use and outputs a
trimmed dictionary to a file.
|
void |
makeDictionary(java.lang.String config_file,
java.lang.String path,
java.lang.String output_file)
Creates dictionary from the given directory; Reads an XML file that specifies which tokenizers to use and outputs
dictionary to a file
|
protected void |
mergeDictionary(java.util.Map<Token,DictionaryEntry> documentTokens)
Merges the given Map of tokens to dictionaryEntry with this dictionary
|
void |
saveDictionaryXML(java.lang.String filename)
Writes this dictionary as a serialized XML object.
|
void |
setCreatingProgram(java.lang.String creatingProgram)
Sets the name of the program that created this dictionary
|
void |
setTokenizers(java.util.List<java.lang.String> tokenizerVec)
Sets the tokenizers used by this dictionary; Only works if no tokenizers have already been set or if the given list
contains the same tokenizers as those that have already been set
|
void |
setTokenizers(TokenizerList tokenizerList)
Loads the tokenizers to use for this dictionary from a TokenizerList object.
|
int |
size()
Provides the number of tokens in this dictionary
|
void |
stdOutDictionaryXML()
Writes dictionary to standard output
|
void |
threadingOn(int numThreads)
Turns on threading for creating this dictionary.
|
void |
trimByIDF(double min,
double max)
Trims this dictionary by removing any token that is outside a range of normalized IDFs.
|
public Dictionary()
public Dictionary(java.lang.String dictionaryFilename)
dictionaryFilename
- the filename containing a dictionarypublic void clone(Dictionary anotherDictionary)
anotherDictionary
- the dictionary object to copypublic boolean containsToken(Token token)
token
- the string containing the token valueboolean
value of whether the token is present in this dictionarypublic void createPartialDictionary(java.util.List<java.io.File> dictionaryFiles, java.io.File directory)
fileLister the fileLister containing the files to use for this dictionary, which needs to have completed loading
dictionaryFiles
- a list of files that will be used to create the dictionarydirectory
- the directory used to create the dictionarypublic java.util.Iterator<java.util.Map.Entry<Token,DictionaryEntry>> iterator()
iterator
in interface java.lang.Iterable<java.util.Map.Entry<Token,DictionaryEntry>>
public java.lang.String getCreatingProgram()
public int getMaxThread()
public java.lang.String getCreation()
public java.lang.String getDictionaryFilename()
public java.lang.String getDictionaryName()
public java.lang.String getGUID()
public double getMaxIDFFound()
double
value of the largest IDF in this dictionarypublic double getMaxIDF()
double
value of the largest IDF in this dictionarypublic double getMinIDF()
double
value of the largest IDF in this dictionarypublic double getNormalizedIDF(Token token)
token
- the string containing the token valuedouble
value of the token's normalized inverse document frequency (IDF); returns 0.0 if
the token is not present in this dictionarypublic double getNormalizedIDF(DictionaryEntry entry)
entry
- the dictionaryEntry to calculate the normalized IDFdouble
value of the dictionaryEntry's normalized inverse document frequency (IDF)public int getPosition(Token token)
token
- the string containing the token valuepublic int size()
public java.lang.String getSource()
public java.lang.String getSystemID()
public TokenizerList getTokenizers()
public java.util.List<java.lang.String> getTokenizerNames()
public java.util.List<Token> getTokens()
public int getTotalDocuments()
public void makeDictionary(java.lang.String config_file, java.lang.String path, java.lang.String output_file)
config_file
- the string name of the XML configuration file containing dictionary tokenizerspath
- the string path to the directory used to create this dictionaryoutput_file
- the string name/path to write this dictionary topublic void makeDictionary(java.util.List<java.lang.String> tokenizersNames, java.lang.String path, double min_idf, double max_idf, java.lang.String output_file)
tokenizersNames
- the list of names of the dictionary tokenizerspath
- the string path to the directory used to create this dictionarymin_idf
- the double
minimum normalized inverse document frequency (IDF) to trim dictionarymax_idf
- the double
maximum normalized inverse document frequency (IDF) to trim dictionaryoutput_file
- the string name/path to write this dictionary toprotected void mergeDictionary(java.util.Map<Token,DictionaryEntry> documentTokens)
documentTokens
- the map containing the dictionary information to merge with this dictionarypublic void saveDictionaryXML(java.lang.String filename)
filename
- the string XML filename to write this dictionary topublic void stdOutDictionaryXML()
public void setCreatingProgram(java.lang.String creatingProgram)
creatingProgram
- the string program that created this dictionarypublic void setTokenizers(TokenizerList tokenizerList)
tokenizerList
- the TokenizerList object containing the tokenizers to use for this dictionarypublic void setTokenizers(java.util.List<java.lang.String> tokenizerVec)
tokenizerVec
- the list of tokenizer names to use to create this dictionarypublic java.lang.String getStatistics()
public void threadingOn(int numThreads)
numThreads
- the number of threads to use to create this dictionarypublic void trimByIDF(double min, double max)
min
- the double
minimum normalized IDF to keep; any token with a lower IDF will be discarded.max
- the double
maximum normalized IDF to keep; any token with a higher IDF will be discarded.