A4: LSTMs

Author: Austin Blodgett

For this homework, you will be relying on the Keras library to implement Long Short Term Memory (LSTM) neural networks to solve three problems: text classification, POS tagging, and language modeling.

It's recommended that you install Anaconda first, and install Keras with the following two lines in the conda terminal:

conda tensorflow
conda keras

1. LSTM Classifier

Finish the code in surname-classifier-lstm.py, which will read surnames, model characters with an LSTM, and predict the language origin of the surname. You will implement an embedding layer, a bidirectional LSTM layer, and a dense layer with a softmax activation for the output layer. To connect an LSTM layer to a Dense layer, you have to set the parameter return_sequences to False. To connect an LSTM layer to another LSTM layer (if you want to use multiple layers), you have to set the parameter return_sequences to True. All of these layer architectures are implemented in Keras. What you need to do is assemble these layers, and then train and evaluate a model. Keras makes assembling a network with multiple different layers easy with its Sequential wrapper. See the example here.

Your code should support a batch size of one or greater. Be aware of the shape and format of your input as you are coding your model. The skeleton code given to you will convert tokens to indices, pad sequences to be the same length in a single batch, and 1-hot encode labels for prediction. Try to understand the dimensionality of inputs and outputs and what they represent.

A matrix's shape has two values: the number of rows and the number of columns. In neural networks we use tensors which can have more complex shapes. A dimension might represent the length of a sequence, the number of batches, etc.

Q: What are the advantages/disadvantages of using a batch size of 1 vs. doing a full pass over the data in each iteration?

Q: What are the dimensionalities of the input and the output of the model (including the batch size) and why?

2. LSTM POS Tagger

For this problem you will build a POS tagger to run on the same data as A3. Copy your model from the previous problem and add it to pos-lstm.py. To change the LSTM classifier to an LSTM language model, you will need to change the shape of the predicted output from predicting a class to predicting a sequence. To do this, you will use the TimeDistributed wrapper on the Dense output layer so that an output gets predicted for each token in the sequence (You also need to set the parameter return_sequences to True in the LSTM layer). Add dropout after each layer (hint: there are two ways to do this).

Q: What are the dimensionalities of the input and the output and how do they differ from the previous model? Why?

3. LSTM Language Model

For this task, you will train an LSTM that for each word predicts the next word. Copy your model from the previous problem and add it to language-model-lstm.py. To train the model, for each token you need to predict the next token in the sequence. To do this, you will shift each sequence by one to create target labels. Implement the function shift_by_one() to take a seq and return a sequence where each index is shifted by 1 (end with a padding token). Add early stopping to your model.

Q. Run generate_text() a few times and comment on how good the generated sequences are.

4. Tuning

After you have your models working, try different sets of parameters for each to try to improve accuracy on the dev set (try and report at least two sets of parameters). Possible options are adjusting the embedding size or the hidden layer size, adding layers to your model, adjusting the number of epochs, adding early stopping, or normalizing the input.

Q: Report the parameters you tried and their accuracy on the dev set in a table. When you are done experimenting, evaluate your model with the best parameters on the test set and report the results.

Notes

You may end up getting NaN loss while training. This is a problem that comes up due to large weights or numerical imprecision. There are a few solutions: (1) You can add L2 regularizers to your model, (2) you can reduce the complexity of your model, (3) you can change your optimizer.
Training LSTMs takes much longer than writing the code. Training your LSTM models will usually take over 30 minutes per epoch depending on the complexity. Run a minimum of three epochs. We don't plan to be strict during grading on how long you trained your models or their final accuracy, but more epochs will usually improve your performance. More importantly, make sure that they run and that you understand the concepts involved.