A4: Perceptron Classifier for Native Language Identification

Author: Nathan Schneider

The Native Language Identification (NLI) task is to predict, given an English document written by an ESL learner, what the author’s native language is. This can be done with better-than-chance accuracy because the native language leaves traces in the way one uses a second language. For this assignment you will complete a multiclass perceptron implementation for the NLI classification task.

ETS has released a dataset of TOEFL essays written by native speakers of 11 languages. For this assignment we’re giving you the essays from medium-proficiency students. The documents have already been tokenized and split into train, dev, and test sets (separate directories).

Download and unzip the assignment data. Your scripts should go in the directory that has train/, dev/, and test/ as subdirectories. Please do not rename any of the files or directories we have given you.

Complete the missing portions of the starter code for loading in the NLI corpus and training, predicting with, and evaluating the model: perceptron.py, evaluation.py.

A Python assert statement checks whether a condition that should be true is actually true, and if it isn’t, raises an AssertionError. Keep the assert statements that are in the starter code: they are there to help catch some common bugs.

Coding convention: Whenever pairing a data point’s gold label and prediction, I suggest using the abbreviations gold and pred, respectively, and always putting the gold label first.

Written answers should go in writeup.pdf:

How many documents are there for each language in the training set? the dev set? (Make a table.) What would be the majority class baseline accuracy on the dev set?
Implement the perceptron algorithm without averaging or early stopping. The maximum number of iterations to run is specified as a command line parameter. (However, do stop before the specified number of iterations if the training data are completely separated, i.e., an iteration proceeds without any errors/updates.) As a baseline featureset, implement bias features and (binary) unigram features. Run training for up to 30 iterations, tracking train and dev set accuracy after each iteration. I suggest printing to stderr. With my implementation, the first 5 lines printed in are:
```
5366 training docs with 154.68747670518076 percepts on avg
598 dev docs with 154.81939799331104 percepts on avg
604 test docs with 153.6341059602649 percepts on avg
iteration: 0 updates=2980, trainAcc=0.4446515095042862, devAcc=0.560200668896321, params=110837
iteration: 1 updates=1176, trainAcc=0.7808423406634365, devAcc=0.540133779264214, params=127017
```
(You don't necessarily need to reproduce this output exactly; it is just for illustration.)

How many iterations are required to separate the training data? Which number of iterations is likely to represent the best tradeoff between fitting the data and not overfitting?
Feature engineering: Experiment with at least 3 different kinds of additional features, which can include:
- word n-grams (n>1)
- character n-grams
- lemmas using NLTK’s WordNetLemmatizer
- uppercase/lowercase normalization
or anything else you like that doesn’t require external tools. Clearly document each kind of feature with a comment in the code. Try different combinations to see how high you can get the dev set accuracy. Keep in mind that different feature combinations may require different numbers of iterations for optimal dev set performance.

Then, on the test set, evaluate your full model as well as ablating each kind of new feature individually. (E.g.: all features, all except character n-grams, all except lemmas, etc.) In your writeup, create a table where each row is a different featureset: the baseline featureset, the full featureset, and the ablations. In the table, report accuracies as well as the value of I (number of iterations for the featureset, tuned on dev data).
Error analysis: For the test set results with your most accurate model, output the following data for analysis (you can modify evaluation.py if you wish):

a) A confusion matrix between languages.

b) For each language, the 10 highest-weighted and 10 lowest-weighted features (with their weights), as well as the bias feature weight for each language.

c) Precision, Recall, and F1 for each language.

Include these in your writeup in some human-readable format. What are some of the patterns you observe?

Hint: A quick way to prepare the analysis is to print data in TSV format to a file and then open the file in a spreadsheet program. Excel and Google Sheets both make it easy to color-code numeric data on a scale, which can highlight trends and exceptional values.

Submission

Before submitting, please ensure that your code runs on the cs-class server with Python 3.

Package all your .py and .pdf files together with the NLI data into a file called a4.zip and submit it via Canvas. (Include ALL files, including the original data files we have given you. Make sure the PDF file answers all the writeup questions.) If for some reason you are having trouble submitting with Canvas, email the instructor and TA with the zip file as an attachment.