Author: Nathan Schneider, adapted from Richard Johansson
In this assignment you will implement a bigram HMM for English part-of-speech tagging.
Starter code: tagger.py
Data: the files en-ud-{train,dev,test}.{upos,ppos}.tsv (see explanation in README.txt)
These files use a simple tab-separated text format. The first column contains the words, and the second the part-of-speech tags. Sentences are separated by blank lines. Here is an example fragment of en-ud-train.upos.tsv:
(The first word is actually a contraction, so the first tag is incorrect. Even "gold" data isn't perfect!)
The starter code contains code to load the tagged corpus. The output of read_tagged_corpus()
is a list of tagged sentences. Each tagged sentence is a list of word–tag pairs:
The training and test sets are loaded separately. (You do not need to use the dev set for this assignment.) No lowercasing or other normalization is performed on words.
The simplest tagger that can be learned from the training data is a most frequent baseline tagger: for each word in the test set, it outputs the most frequent tag observed with that word in the training corpus, ignoring context (hence, it is a unigram tagger). For previously unseen words, it outputs the tag that is most frequent in general.
Implement learn()
as indicated in comments to compute frequencies in the training data and store them in allTagCounts
and perWordTagCounts
. Implement baseline_tag_sentence()
to use those frequencies to make a prediction for a test sentence.
[9 pts]
You will now implement the bigram HMM tagger. We start with the easy part: the estimation of the transition and emission probabilities.
Complete the rest of learn()
to store these in the designated data structures.
Store them as log probabilities using the log()
function from the math
module (which computes the natural log by default).
Each of the distributions should use add-α smoothing (ALPHA=.1
) and include an unknown entry using the constant UNK
.
The constants START
and END
are provided for start-of-sentence and end-of-sentence symbols, respectively.
[9 pts]
Implement joint_prob()
to calculate the joint log probability of the provided sentence's words and tags according to the learned transition and emission parameters. This will be called for both gold and predicted taggings of each test sentence.
[9 pts]
To tag a sentence, you need to apply the Viterbi algorithm, and then retrace your steps back to the initial dummy item.
hmm_tag_sentence()
is the method that orchestrates the tagging of a sentence using the Viterbi algorithm and returns the tagged words. There are 3 helper functions you should use: viterbi()
, which creates and completes the Viterbi chart; find_best_item()
, which computes the value and backpointer for a single chart cell; and retrace()
, which follows backpointers in the completed chart to reconstruct the best tagging of the sentence.
Implement these methods according to the comments.
Hint: When you determine the possible tags for a word, for efficiency reasons you should consider only the tags that have been observed in the training corpus for that word. For a word that was unseen in training, consider all tags.
Clarification: What the pseudocode refers to as an "item" should be implemented as a tuple consisting of 1) the tag at that position; 2) a reference to a preceding item; 3) the log probability of the sequence. The item thus represents the chart cell, its backpointer, and its value.
[15 pts]
Finally, complete the implementation of count_correct()
to compare a sentence's gold tagging with a prediction and return quantities that will be used to compute accuracy.
[3 pts]
Below count_correct()
, the starter code provides the paths to the training and test sets (for Universal tags; you will have to change this for experiments with Penn tags).
Then there is code that loads the data, calls the functions you have implemented above, and prints each test sentence with gold and predicted taggings, followed by summary statistics.
The code can be run as
which evaluates the baseline tagger, or
which evaluates the HMM tagger.
Each test sentence will be printed twice: first with gold tags, then with predicted tags. At the beginning of the line is the HMM model's joint log probability. At the end of the line with the predicted tags is the accuracy for that sentence.
For example, in the reference implementation, the following is printed for the last test sentence when running the baseline tagger:
When running the HMM tagger, the result is:
If you are using a terminal which supports ANSI color-coding, incorrect predicted tags and their gold counterparts will be colored.
Answer these in writeup.txt and submit them in a zip file along with your code. [15 pts]
What accuracies do the baseline and HMM taggers achieve on the test set? Report accuracies for (i) all tokens, (ii) unseen (OOV) tokens, (iii) perfect tagging of full sentences. (Code is already there to display this information for you.) [2 pts]
Which word types are most frequently tagged incorrectly by the HMM, and why? (You will need to write some additional code to answer this.) [5 pts]
With the HMM tagger, for how many sentences does the gold tagging have higher probability than the Viterbi tagging? How is this possible if the Viterbi algorithm is supposed to return the most probable tagging under the model? Hint: Uncomment the assertion on line 272 so the program will terminate after printing the first sentence with this property. You may want to inspect some of the probabilities pertaining to this sentence. [2 pts]
Now run the baseline and HMM taggers on the Penn-tagged data (you will need to change the paths in TRAIN_DATA
and TEST_DATA
). What are the overall token accuracies? Are they higher or lower than with the Universal tagset? Why do you think this is the case? [2 pts]
How much time does it take for training and predicting with each of the taggers, in both tagsets? (Code is already there to display this information for you.) Does training time or prediction time dominate, and why? [2 pts]
If there was significantly more TRAINING data, would you expect a noticeable impact on prediction runtime? Why or why not? [2 pts]