A4: LSTMs

Author: Austin Blodgett

For this homework, you will be relying on the Keras library to implement Long Short Term Memory (LSTM) neural networks to solve three problems: text classification, POS tagging, and language modeling.

It's recommended that you install Anaconda first, and then run the following lines in the conda terminal. This installs Tensorflow, Keras (included in the latest version of Tensorflow), and BERT.

    pip install --upgrade pip
    pip install tensorflow
    pip install transformers

As always, do not look at the test data until you are finished building and tuning your parser. Use only the train and dev data until then.

1. LSTM Classifier

Finish the code in surname-classifier-lstm.py, which will read surnames, model characters with an LSTM, and predict the language origin of the surname. You will implement an embedding layer, a bidirectional LSTM layer, and a dense layer with a softmax activation for the output layer. To connect an LSTM layer to a Dense layer, you have to set the parameter return_sequences to False. To connect an LSTM layer to another LSTM layer (if you want to use multiple layers), you have to set the parameter return_sequences to True. All of these layer architectures are implemented in Keras. What you need to do is assemble these layers, and then train and evaluate a model. Keras makes assembling a network with multiple different layers easy with its Sequential wrapper. See the example here.

Your code should support a batch size of one or greater. Be aware of the shape and format of your input as you are coding your model. The skeleton code given to you will convert tokens to indices, pad sequences to be the same length in a single batch, and 1-hot encode labels for prediction. Try to understand the dimensionality of inputs and outputs and what they represent.

A matrix's shape has two values: the number of rows and the number of columns. In neural networks we use tensors which can have more complex shapes. A dimension might represent the length of a sequence, the number of batches, etc.

Data: surname-data/surnames.csv

Q: What are the advantages/disadvantages of using a batch size of 1 vs. doing a full pass over the data in each iteration?

Q: What are the dimensionalities of the input and the output of the model (including the batch size) and why?

[Code 9 pts, Written 6 pts]

2. LSTM POS Tagger

For this problem you will build a POS tagger to run on the same data as A3. Copy your model from the previous problem and add it to pos-lstm.py. To change the LSTM classifier to an LSTM language model, you will need to change the shape of the predicted output from predicting a class to predicting a sequence. To do this, you will use the TimeDistributed wrapper on the Dense output layer so that an output gets predicted for each token in the sequence (You also need to set the parameter return_sequences to True in the LSTM layer). Add dropout after each layer (hint: there are two ways to do this).

Data: pos-data/en-ud-train.upos.tsv (Train), pos-data/en-ud-dev.upos.tsv (Dev), pos-data/en-ud-test.upos.tsv (Test)

Q: What are the dimensionalities of the input and the output and how do they differ from the previous model? Why?

Q: What is the point of adding dropout to your model? What are some advantages/disadvantages of using a high dropout value?

[Code 9 pts, Written 6 pts]

3. LSTM Language Model

For this task, you will train an LSTM that for each word predicts the next word. You will implement the model with GloVe word embeddings and with BERT contextualized embeddings and compare the two versions. Copy your model from the previous problem and add it to language-model-lstm.py. To train the model, for each token you need to predict the next token in the sequence. To do this, you will shift each sequence by one to create target labels. Implement the function shift_by_one() to take a seq and return a sequence where each index is shifted by 1 (end with a padding token). You will also need to switch out the embeddings layer for pretrained embeddings. Instructions are below.

Add GloVe embeddings to your model. Use this download link to get glove.6B pretrained embeddings. DO NOT upload them when you submit your homework. The function load_pretrained_embeddings() looks into a directory called glove.6B and returns a matrix of weights. You can finish implementing it and use it to initialize the weight matrix of your embedding layer (See example code here).

To use BERT contextualized embeddings, you can use a BERT_Wrapper which is a class defined in language-model-lstm.py. This instantiates a pretrained BERT model with fixed weights to use as your initial layer. Unlike with GloVe embeddings, BERT embeddings are unique for each sentence, so you need to replace the entire embedding layer with a BERT_Wrapper, not just the weights. When running your model with BERT, make sure to use the command line argument --use-bert. Keep in mind for your discussion that BERT uses wordpiece tokenization, so both the tokenization and vocabulary will be different than the GloVe model.

Data: lm-data/little-prince-train.txt (Train), lm-data/little-prince-dev.txt (Dev), lm-data/little-prince-test.txt (Test)

Q. Describe the difference in training time, the number of epochs needed to converge, and final perplexity of the GloVe and BERT versions of your model. Why do you think these differences occur?

Q. Run generate_text() a few times and comment on how good the generated sequences are for both the GloVe and BERT versions of the model.

Important Note: Do not use Bidirectional in your language model. BiLSTMs process a sequence from both left-to-right and right-to-left, but the right-to-left direction will not be available to your model at generation time. This can lead to your model failing at prediction.

[Code 9 pts, Written 6 pts]

4. Tuning

After you have your models working, try different sets of parameters for each to try to improve accuracy on the dev set (try and report at least two sets of parameters). Possible options are adjusting the embedding size or the hidden layer size, adding layers to your model, adjusting the number of epochs, adding early stopping, or normalizing the input.

Q: Report the parameters you tried and their accuracy on the dev set in a table. When you are done experimenting, evaluate your model with the best parameters on the test set and report the results.

[15 pts]

Notes

Training LSTMs takes much longer than writing the code. Training your LSTM models can take over 30 minutes per epoch depending on the complexity. Run a minimum of three epochs. We don't plan to be strict during grading on how long you trained your models or their final accuracy, but more epochs will usually improve your performance. More importantly, make sure that they run and that you understand the concepts involved.