Authors: Klinton Bicknell, Harry Eldridge, Nathan Schneider
Turning it in. Upload a .zip file of all your code, output files, and written responses to Canvas
Starter code: problem1.py, problem2.py, problem3.py, generate.py (with bugfix)
word_to_index
dictionary[Coding only: use starter code problem1.py]
The first step in building an n-gram model is to create a dictionary that maps words to indices (which we’ll use to access the elements corresponding to that word in a vector or matrix of counts or probabilities). You’ll create this dictionary from a vocabulary file that lists each word in the corpus exactly once: brown_vocab_100.txt
This file lists the 811 word types used in the first 100 sentences of
the Brown corpus and includes the special word types <s>
and </s>
,
for a total of 813 word types. In this version of the Brown corpus,
punctuation marks are also treated as words, so they are also included
in this total. In the corpus itself, punctuation words are separated
from other words by a space, just like other words are, so you won’t
need to treat them differently at all. The brown_vocab_100.txt file is
formatted to contain one word per line. We want to create a dictionary
in python that maps each word to its order of occurrence in this file,
starting with 0. For example, the first word in the file is ‘all’ and
this should map to 0. The last word is ‘resolution’ and this should map
to 812. To create the dictionary in python, you’ll iterate over each
line (which is a word plus a newline character), use rstrip()
to
remove the trailing newline character, and then add the word to the
dictionary mapped to its appropriate index. To do this, you’ll need to
keep track of which line of the file you’re in using enumerate()
.
After creating this dictionary, write the dictionary to a file called
word_to_index_100.txt
. Because wf.write()
only writes strings, you
will need to convert the dictionary to a string before you can write it
using the str()
function. To check that you’ve done this correctly,
verify in the file output that ‘all’ is mapped to 0 and ‘resolution’ to
812.
[Coding and written answer: use starter code problem2.py]
Now you’ll build a simple MLE unigram model from the first 100 sentences in the Brown corpus, found in: brown_100.txt
This corpus is represented as one sentence per line with a space
separating all words. You’ll need to split the sentences into a list of
words (e.g., using the string’s .split()
member function), convert
each word to lowercase (e.g., using the string’s .lower()
member
function) and add the end-of-sentence word .</s>
First, copy your code from problem 1 to load the dictionary mapping
words to indices. Then import numpy as np
and initialize the numpy
vector of counts
with zeros. Finally, iterate through the sentences
and increment counts for each of the words they contain.
Now print the counts
vector. (Do this in the python interpreter. Don’t
add this printing to your script.) Q: Estimate (just by eyeballing) the
proportion of the word types that occurred only once in this corpus. Do
you think the proportion of words that occur only once would be higher
or lower if we used a larger corpus (e.g., all 57000 sentences in
Brown)? Why or why not?
Finally, to normalize your counts vector and create probabilities, you can simply divide the counts vector by its sum in numpy like so:
probs = counts /
Write your new probs
vector to a file called unigram_probs.txt
and
verify that the first probability in it (the probability of ‘all’) is
0.00040519 and that the last probability (probability of ‘resolution’) is
0.00364668. (Note that our model trained on this small corpus has
estimated that ‘resolution’ is about 10 times as frequent as ‘all’!
Models trained on very small corpora are very noisy.)
[Coding only: use starter code problem3.py]
Now, you’ll create an MLE bigram model, in much the same way as you created an MLE unigram model. I recommend writing the code again from scratch, however (except for the code initializing the mapping dictionary), so that you can test things as you go. The main differences between coding an MLE bigram model and a unigram model are:
previous_word = '<s>'
just prior to the for
loop iterating through a line (skipping the '<s>'
at the 0th index), and then (2)
inside the for
loop, adding a line that sets
previous_word = word
after incrementing the counts for the
current previous_word, word
pair. Think this through to make sure
you understand why this ensures that the variable previous_word
will always be the appropriate previous word when incrementing
counts.you’ll now normalize each row of the counts matrix instead of just
normalizing a single counts vector. To do that, replace the line
involving np.sum()
with the lines
from sklearn.preprocessing import normalizeprobs =
Make sure the matrix is arranged so each row represents the probability of a word given the seen word so that it works with the generator. For more on this see generator.py
When complete, add code to write the following probabilities to
bigram_probs.txt
, one per line:
Make sure that the first two of these probabilities are 1.0 and about .08333.
[Coding and written answer: save code as problem4.py
]
This time, copy problem3.py
to problem4.py
. We’ll just be making a
very small modification to the program to add smoothing. In class, we
discussed two types of smoothing in detail: Laplace smoothing, in which
1 pseudocount is added to every bigram count, and add-α smoothing, in which some amount α is added to every bigram count.
(Laplace smoothing is thus a special case of add-α smoothing.)
You should modify your problem to use add-α smoothing with α, i.e., pretending that we saw an extra one-tenth of an instance of each bigram. With numpy, you can very simply add 0.1 to every cell of a matrix like so:
counts += 0.1
Now also change the program to write the same four probabilities as
before to a file called smooth_probs.txt
. To verify that your program
is working correctly, check that the first probability in the file is
about 0.0133657.
Finally, compare smooth_probs.txt
to bigram_probs.txt
. Q: Why did
all four probabilities go down in the smoothed model? Now note that the
probabilities did not all decrease by the same amount. In particular,
the two probabilities conditioned on ‘the’ dropped only slightly, while
the other two probabilities (conditioned on ‘all’ and ‘anonymous’)
dropped rather dramatically. Q: Why did add-α smoothing cause probabilities
conditioned on ‘the’ to fall much less than these others? And why is
this behavior (causing probabilities conditioned on ‘the’ to fall less
than the others) a good thing? In figuring this out, you may find it
useful to look at the relevant individual rows of the counts matrix
(prior to adding the 0.1) to see how they’re different. In numpy, you
can look at nth row of the counts
matrix using counts[n,]
.
[Coding only: save code as problem5.py
]
Using your knowledge of language models, compute what the following probabilities would be in both a smoothed and unsmoothed trigram model (note, you should not be building an entire model, just what you need to calculate these probabilities):
p(time | in, the)
p(said | the, jury)
p(recommended | the, jury)
p(that | jury, said)
[Coding and written answer]
For this problem, you will use each of the three models you’ve constructed in problems 2–4 to evaluate the probability of a toy corpus of sentences (containing just the first two sentences of the Brown corpus) found at: toy_corpus.txt
The sentences are contained in this corpus in the same format as the other corpora you’ve been using so far: one sentence per line and words separated by a space. As before, you’ll need to remove the end-of-line character and add the end-of-sentence word.
First, you’ll edit problem2.py
, and add code at the bottom of the
script to iterate through each sentence in the toy corpus and calculate
the joint probability of all the words in the sentence under the unigram
model. Then write the probability of each sentence to a file
unigram_eval.txt
, formatted to have one probability for each line of
the output file. To do this, you’ll be writing a loop within a loop very
similarly to how you iterated through the training corpus to estimate
the model parameters. But instead of incrementing counts after each
word, you’ll be updating the joint probability of the sentence
(multiplying the probabilities of each word together). One easy way to
do this is to initialize sentprob = 1
prior to looping through the
words in the sentence, and then just update sentprob *= wordprob
with
the probability of each word. At the end of the loop, sentprob
will
contain the total joint probability of the whole sentence. To verify
that this worked correctly, note that the joint probability of the
second sentence under this model should be 4.84008387782e-99
Next, you’ll transform this joint probability into a perplexity (of each
sentence), and write that to the file instead. To calculate the
perplexity, first calculate the length of the sentence in words (be sure
to include the end-of-sentence word) and store that in a variable
sent_len
, and then you can calculate
perplexity = 1/(pow(sentprob, 1.0/sent_len))
, which reproduces the
definition of perplexity we discussed in class. Now, write the
perplexity of each sentence to the output file instead of the joint
probabilities. To verify that you’ve done this correctly, note that the
perplexity of the second sentence with this model should be about 2.48221815792 153.
Now, you’ll do the same thing for your other two models. Add code to
problem3.py
to calculate the perplexities of each sentence in the toy
corpus and write that to a file bigram_eval.txt
. Similarly, add code
to problem4.py
and write the perplexities under the smoothed model to
smoothed_eval.txt
. To verify that you did these correctly, note that
the perplexity of the second sentence should be about 1.43002288867 7.237 with the MLE
bigram model and about 0.871096875925 49.63 for the smoothed bigram model.
Compare the perplexities of these two sentences under all three models. Q: Which model performed worst and why might you have expected that model to have performed worst? Q: Did smoothing help or hurt the model’s ‘performance’ when evaluated on this corpus? Why might that be?
[Written answer]
Edit problem2-4.py to use the provided generator code to generate some (~10) sentences from each of your models.
Store the sentences in unigram_generation.txt
, bigram_generation.txt
, and smoothed_generation.txt
to go with their respective models.
Q: Compare the models qualitatively based on their generation. How does each perform?