A3: HMM for POS Tagging

Author: Nathan Schneider, adapted from Richard Johansson

In this assignment you will implement a bigram HMM for English part-of-speech tagging.

Starter code: tagger.py

Data: the files en-ud-{train,dev,test}.{upos,ppos}.tsv (see explanation in README.txt)

Everything as a zip file

0. Reading the tagged data

These files use a simple tab-separated text format. The first column contains the words, and the second the part-of-speech tags. Sentences are separated by blank lines. Here is an example fragment of en-ud-train.upos.tsv:

Im    PROPN
moving    VERB
from    ADP
South    PROPN
Carolina    PROPN
so    ADV
this    PRON
will    AUX
already    ADV
be    AUX
a    DET
huge    ADJ
culture    NOUN
shock    NOUN
.    PUNCT
 
Any    DET
suggestions    NOUN
?    PUNCT

(The first word is actually a contraction, so the first tag is incorrect. Even "gold" data isn't perfect!)

The starter code contains code to load the tagged corpus. The output of read_tagged_corpus() is a list of tagged sentences. Each tagged sentence is a list of word–tag pairs:

[[('Im''PROPN'), ('moving''VERB'), ('from''ADP'), 
  ('South''PROPN'), ('Carolina''PROPN'), ('so''ADV'), 
  ('this''PRON'), ('will''AUX'), ('already''ADV'), 
  ('be''AUX'), ('a''DET'), ('huge''ADJ'), 
  ('culture''NOUN'), ('shock''NOUN'), ('.''PUNCT')], 
 [('Any''DET'), ('suggestions''NOUN'), ('?''PUNCT')]]

The training and test sets are loaded separately. (You do not need to use the dev set for this assignment.) No lowercasing or other normalization is performed on words.

1. Baseline unigram tagger

The simplest tagger that can be learned from the training data is a most frequent baseline tagger: for each word in the test set, it outputs the most frequent tag observed with that word in the training corpus, ignoring context (hence, it is a unigram tagger). For previously unseen words, it outputs the tag that is most frequent in general.

Implement learn() as indicated in comments to compute frequencies in the training data and store them in allTagCounts and perWordTagCounts. Implement baseline_tag_sentence() to use those frequencies to make a prediction for a test sentence.

[9 pts]

2. Estimating the HMM parameters

You will now implement the bigram HMM tagger. We start with the easy part: the estimation of the transition and emission probabilities. Complete the rest of learn() to store these in the designated data structures. Store them as log probabilities using the log() function from the math module (which computes the natural log by default). Each of the distributions should use add-α smoothing (ALPHA=.1) and include an unknown entry using the constant UNK. The constants START and END are provided for start-of-sentence and end-of-sentence symbols, respectively.

[9 pts]

3. Computing HMM joint probability of a sentence and tags

Implement joint_prob() to calculate the joint log probability of the provided sentence's words and tags according to the learned transition and emission parameters. This will be called for both gold and predicted taggings of each test sentence.

[9 pts]

4. Tagging with the HMM

To tag a sentence, you need to apply the Viterbi algorithm, and then retrace your steps back to the initial dummy item.

hmm_tag_sentence() is the method that orchestrates the tagging of a sentence using the Viterbi algorithm and returns the tagged words. viterbi(), which has already been implemented for you, creates and completes the Viterbi chart.

We won't have you implement the whole Viterbi algorithm this semester, but it would benefit you to understand the implementation of viterbi(). This function uses dynamic programming to build a chart of possible tags where the scores for tagging a given token depend on possible tags of the previous token. For a detailed description of the Viterbi Algorithm, reference 8.4.5 of the textbook (SLP3).

viterbi() uses two helper methods which you need to implement: find_best_item(), which computes the value and backpointer for a single chart cell; and retrace(), which follows backpointers in the completed chart to reconstruct the best tagging of the sentence. Implement these methods according to the comments.

Hint: When you determine the possible tags for a word, for efficiency reasons you should consider only the tags that have been observed in the training corpus for that word. For a word that was unseen in training, consider all tags.

Clarification: What the pseudocode refers to as an "item" should be implemented as a tuple consisting of 1) the tag at that position; 2) a reference to a preceding item; 3) the log probability of the sequence. The item thus represents the chart cell, its backpointer, and its value.

[15 pts]

5. Evaluating predictions

Finally, complete the implementation of count_correct() to compare a sentence's gold tagging with a prediction and return quantities that will be used to compute accuracy.

[3 pts]

6. Running the code

Below count_correct(), the starter code provides the paths to the training and test sets (for Universal tags; you will have to change this for experiments with Penn tags). Then there is code that loads the data, calls the functions you have implemented above, and prints each test sentence with gold and predicted taggings, followed by summary statistics. The code can be run as

$ python3 tagger.py baseline

which evaluates the baseline tagger, or

$ python3 tagger.py hmm

which evaluates the HMM tagger.

7. Understanding the output

Each test sentence will be printed twice: first with gold tags, then with predicted tags. At the beginning of the line is the HMM model's joint log probability. At the end of the line with the predicted tags is the accuracy for that sentence.

For example, in the reference implementation, the following is printed for the last test sentence when running the baseline tagger:

-152.54779748745457 He/PRON listens/VERB and/CCONJ is/AUX excellent/ADJ in/SCONJ diagnosing/VERB ,/PUNCT addressing/VERB and/CCONJ explaining/VERB the/DET specific/ADJ issues/NOUN and/CCONJ suggesting/VERB exercises/NOUN to/PART use/VERB ./PUNCT
-146.95347972656901 He/PRON listens/NOUN and/CCONJ is/AUX excellent/ADJ in/ADP diagnosing/NOUN ,/PUNCT addressing/NOUN and/CCONJ explaining/VERB the/DET specific/ADJ issues/NOUN and/CCONJ suggesting/VERB exercises/NOUN to/PART use/VERB ./PUNCT 80%

When running the HMM tagger, the result is:

-152.54779748745457 He/PRON listens/VERB and/CCONJ is/AUX excellent/ADJ in/SCONJ diagnosing/VERB ,/PUNCT addressing/VERB and/CCONJ explaining/VERB the/DET specific/ADJ issues/NOUN and/CCONJ suggesting/VERB exercises/NOUN to/PART use/VERB ./PUNCT
-144.87988835641178 He/PRON listens/NOUN and/CCONJ is/AUX excellent/ADJ in/ADP diagnosing/PROPN ,/PUNCT addressing/PROPN and/CCONJ explaining/VERB the/DET specific/ADJ issues/NOUN and/CCONJ suggesting/VERB exercises/ADV to/PART use/VERB ./PUNCT 75%

If you are using a terminal which supports ANSI color-coding, incorrect predicted tags and their gold counterparts will be colored.

8. Questions

Answer these in writeup.txt and submit them in a zip file along with your code. [15 pts]

For the Universal tagset

  1. What accuracies do the baseline and HMM taggers achieve on the test set? Report accuracies for (i) all tokens, (ii) unseen (OOV) tokens, (iii) perfect tagging of full sentences. (Code is already there to display this information for you.) [2 pts]

  2. Which word types are most frequently tagged incorrectly by the HMM, and why? (You will need to write some additional code to answer this.) [5 pts]

  3. With the HMM tagger, for how many sentences does the gold tagging have higher probability than the Viterbi tagging? How is this possible if the Viterbi algorithm is supposed to return the most probable tagging under the model? Hint: Uncomment the assertion on line 272 so the program will terminate after printing the first sentence with this property. You may want to inspect some of the probabilities pertaining to this sentence. [2 pts]

Comparing tagsets

  1. Now run the baseline and HMM taggers on the Penn-tagged data (you will need to change the paths in TRAIN_DATA and TEST_DATA). What are the overall token accuracies? Are they higher or lower than with the Universal tagset? Why do you think this is the case? [2 pts]

  2. How much time does it take for training and predicting with each of the taggers, in both tagsets? (Code is already there to display this information for you.) Does training time or prediction time dominate, and why? [2 pts]

  3. If there was significantly more TRAINING data, would you expect a noticeable impact on prediction runtime? Why or why not? [2 pts]