Primary author: Harry Eldridge
You will need:
from nltk.corpus import wordnet
(usage can be found here, along with information on similarity measures)from nltk.corpus import semcor
(usage, source code)This assignment has multiple parts.
Implement a python program called random_substitution.py that, given an integer (specified by a command line argument), takes the sentence in the SemCor corpus at that index, prints the sentence, substitutes each tagged chunk with either a random synonym, hypernym, or hyponym, and then prints the resulting sentence. Whether the program substitutes synonyms, hypernyms, or hyponyms should be specified by the -nym
command line argument.
Example: To substitute on the sentence at index 12 using hypernyms, you would run: python random_substitution.py 12 -nym hypernym
Attach a short write up writeup.txt analyzing your results. What happens to the meaning and naturalness of a sentence if sense-tagged words are replaced with random synonyms? hyponyms? hypernyms?
Hint: It may be a good idea to seed your random number generator to simplify debugging.
Edit edit_distance.py so that the program, given two integers specified by the command line arguments, computes the Levenshtein edit distance between the two untagged sentences in the SemCor corpus at those indices.
The distance should be over full words: each sentence should be a list of tokens rather than a string. Use semcor.sents()
without any additional preprocessing on the tokens. If two token strings do not match exactly, they are considered completely different symbols.
Your script should print 4 lines: the two sentences (as space-separated text), the computed distance, and the minimum edit sequence with individual costs.
Example: To compare the sentences at indices 12 and 24 you would run: python edit_distance.py 12 24
Saving it under wordnet_edit_distance.py, edit your program from part 2 so that instead of using a static substitution cost of 1, it uses a WordNet similarity measure to compute the substitution cost.
WordNet similarity measures analyze the relationship
of two synsets in the hierarchy to quantify their similarity.
Note that they are similarity measures (higher = more similar), not distance or cost measures, so use the formula cost = 1 - similarity
.
Your script should have the option to use either path similarity or Wu-Palmer similarity. Both map to [0,1], where the similarity between a synset and itself should be 1 (though as of this writing, the implementation is slightly buggy). This option should be specified by the -sim command line argument -sim path
or -sim wup
.
If one or both tokens involved in the candidate substitution do not have a synset annotation, or their similarity computation returns None
, use 1 as the substitution cost.
Whereas the previous program operated directly over word tokens in the sentence, this program should operate over SemCor chunks.
Example: To compare the sentences at indices 12 and 24 using Wu-Palmer similarity you would run: python wordnet_edit_distance.py 12 24 -sim wup
In writeup.txt, describe the results. Does this seem like a better approach than regular edit distance to computing whether sentences are similar or different? Do the two WordNet similarity algorithms produce different results, and if so, which seems better? What are some of the limitations of this approach to measuring sentence similarity?
Saving it under wordnet_edit_distance2.py, edit your program from part 3 so that it compares one untagged and one tagged SemCor sentence. When computing the substitution costs between words, find the synset of the untagged word most similar to the tagged one, and use the similarity between those synsets as the cost. For this part assume neither sentence contains multiword chunks.
Example: To compare the untagged sentence at index 12 and the tagged sentence at index 24 using path similarity you would run: python wordnet_edit_distance2.py 12 24 -sim path
In writeup.txt, briefly note your impression of the results: How does the lack of gold sense tags for one of the sentences affect the score relative to part 3?
A brief SemCor explanation: SemCor is a corpus of sentences split into word sequences or "chunks." Each chunk is either unlabeled or is tagged with its WordNet lemma, which maps to a synset.
You should use the helper class in semcor_chunk.py to more easily extract the relevant information from the chunks. This works on the chunks in the sentences returned by calls to semcor.tagged_sents(tag='sem')
.
If you only want the words out of a sentence, ignoring the chunks and tags (parts 2 and 4), use semcor.sents()
.
The following groups of short sentences are close in construction, making them good examples to test on: