A1: Classifiers for Native Language Identification

The Native Language Identification (NLI) task is to predict, given an English document written by an ESL learner, what the author’s native language is. This can be done with better-than-chance accuracy because the native language leaves traces in the way one uses a second language.

ETS has released a dataset of TOEFL essays written by native speakers of 11 languages. For this assignment we’re giving you the essays from medium-proficiency students. The documents have already been tokenized and split into train, dev, and test sets (separate directories).

Download and unzip the assignment data. Your scripts should go in the directory that has train/, dev/, and test/ as subdirectories. Please do not rename any of the files or directories we have given you.

Part I. Naïve Bayes

Download nbmodel.py and evaluation.py. Complete nbmodel.py so it implements a naïve Bayes model with words as the only features. Also complete the functions to load in the NLI corpus.

A Python assert statement checks whether a condition that should be true is actually true, and if it isn’t, raises an AssertionError. Keep the assert statements that are in the starter code: they are there to help catch some common bugs.

Your conditional word probability estimates should use Add-α smoothing. As described in lecture 6, Add-α (Lidstone) smoothing is a generalization of Add-1 smoothing which allows for a custom amount of smoothing to be applied depending on the dataset. You should not smooth the prior distribution.

There are actually two variants of naïve Bayes: one that models unique words per document, and one that models word frequencies. The equations in the lecture slides are for the former, but you should implement the latter, which is what the textbook describes (SLP3 section 7.1).

Written answers should go in a1.nb.pdf:

How many documents are there for each language? (Make a table.) Add a line of code to display the learned prior distribution over classes. What is it? Ensure that it is correct.
How many documents are there for each language in the dev set? What would be the majority class baseline accuracy on the dev set?
Tune the value of α by measuring performance on the dev set. Try α ∈ {0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0}. Create a table or plot of each α value and the corresponding dev accuracy. (Hint: If the accuracy is very close to 0 or very close to 1 with α=1.0, you have a bug.)
Lemmatization is a way to counteract the sparsity introduced by morphological inflection. Use NLTK’s WordNetLemmatizer to lemmatize words in the corpus if the -l option is provided to your script. Show results for tuning α on the dev set with lemmatization. Does lemmatization affect the best value of α, and why or why not? Comparing results from the best values of α, does lemmatization help or hurt?
Final evaluation: Take the best model (as measured by the dev set) and measure its performance on test data. How do the two accuracies compare? (The starter code evaluates on the dev set, but the code you turn in should evaluate on the test set.)

Part II: The Perceptron

For this part of the assignment you will complete a multiclass perceptron implementation for the same classification task as Part I. Starter code: perceptron.py

Coding convention: Whenever pairing a data point’s gold label and prediction, I suggest using the abbreviations gold and pred, respectively, and always putting the gold label first.

Written answers should go in a1.perceptron.pdf:

6. Implement the perceptron algorithm without averaging or early stopping. The maximum number of iterations to run is specified as a command line parameter. (However, do stop before the specified number of iterations if the training data are completely separated, i.e., an iteration proceeds without any errors/updates.) As a baseline featureset, implement bias features and (binary) unigram features. Run training for up to 30 iterations, tracking train and dev set accuracy after each iteration. I suggest printing to stderr. With my implementation, the first 5 lines printed in are:

5366 training docs with 154.68747670518076 percepts on avg
598 dev docs with 154.81939799331104 percepts on avg
604 test docs with 153.6341059602649 percepts on avg
iteration: 0 updates=2980, trainAcc=0.4446515095042862, devAcc=0.560200668896321, params=110837
iteration: 1 updates=1176, trainAcc=0.7808423406634365, devAcc=0.540133779264214, params=127017

(You don't necessarily need to reproduce this output exactly; it is just for illustration.)

How many iterations are required to separate the training data? Which number of iterations is likely to represent the best tradeoff between fitting the data and not overfitting?

7. Feature engineering: Experiment with at least 3 different kinds of additional features, which can include:

word n-grams (n>1)
character n-grams
lemmas
uppercase/lowercase normalization

or anything else you like that doesn’t require external tools. Clearly document each kind of feature with a comment in the code. Try different combinations to see how high you can get the dev set accuracy. Keep in mind that different feature combinations may require different numbers of iterations.

Then, on the test set, evaluate your full model as well as ablating each kind of new feature individually. (E.g.: all features, all except character n-grams, all except lemmas, etc.) In your writeup, create a table where each row is a different featureset: the baseline featureset, the full featureset, and the ablations. In the table, report accuracies as well as the value of I (number of iterations for the featureset, tuned on dev data).

8. Error analysis: For the test set results with your most accurate model, output the following data for analysis (you can modify evaluation.py if you wish):

a) A confusion matrix between languages.

b) For each language, the 10 highest-weighted and 10 lowest-weighted features (with their weights), as well as the bias feature weight for each language.

c) Precision, Recall, and F1 for each language.

Include these in your writeup in some human-readable format. What are some of the patterns you observe? Do the bias feature weights behave like priors in naïve Bayes—why or why not?

Hint: A quick way to prepare the analysis is to print data in TSV format to a file and then open the file in a spreadsheet program. Excel and Google Sheets both make it easy to color-code numeric data on a scale, which can highlight trends and exceptional values.

9. Extra credit: Implement weight averaging when an -a flag is provided. Recreate the feature ablation table from #7, but with averaging. What is the effect on accuracy? Runtime?

Submission

Before submitting, please ensure that your code runs on the cs-class server. The A0 instructions describe how to log into the server and configure it for Python 3.

Package all your .py and .pdf files together into a file called a1.zip and submit it via Canvas. (Do not include the data files we have provided you. You may include supplementary files—sample output, analysis files, etc., as long as everything we have specifically asked for is in the two .pdf files. Include ALL files, including the original data files we have given you. Make sure the two PDF files answer all the writeup questions.) The deadline is ~~Friday 10/7~~ Monday 10/10 at 4:59pm. (Note that it is mid-semester break, so there will be no class or office hours that day.) If for some reason you are having trouble submitting with Canvas, email the instructor and TA with the zip file as an attachment.