COSC-288 Project 2

COSC-288: Machine Learning

Project 2
Spring 2021

Due: F 3/12 @ 5:00 P.M.
10 points

Building upon your implementation for p1, implement k-NN and naive Bayes for nominal attributes. Also implement routines for evaluating classifiers using k-fold cross-validation.

The implementations should be general in the sense that they should work for all data sets with nominal attributes and nominal class labels. Our convention is that the last attribute of the attribute declarations is the class label.

Tasks:

We will design and implement a class hierarchy for classifiers for this and subsequent projects. Methods common to all classifiers include
- Performance classify( DataSet )
- int classify( Example )
- void train( DataSet )
Implement k-NN as the class IBk. Derive IBk from Classifier. Your implementation does not have to use a heap, but the data structure you use to determine the k-closest neighbors must use O(k) space. The implementation of k-NN should process the command-line switch -k to let users specify the value of k. Use 3 as the default.
Implement naive Bayes as the class NaiveBayes. Derive NaiveBayes from Classifier. It should use add-one smoothing for nominal attributes. Implement Estimator and derive CategoricalEstimator. The derived class must implement methods that add values to the estimator and get the probability of values.
Implement the Evaluator class that evaluates the performance of any Classifier using training and testing sets or using k-fold cross-validation. Use the option -x to let users specify the number of folds, with 10 as the default. Use the option -s to let users specify a seed for the random-number generator.
It is important to let uses specify a seed for the random-number generator we use to randomly select a set of examples for training and testing. It is often the case that we want to reproduce exactly an experiment. If we use the same seed to initialize the random-number generator, then the implementation will select the same examples for training and testing, which lets us reproduce the same experiment.
To achieve this behavior, the Evaluator should process its options. If -s is present in the options, then Evaluator.setOptions should use the specified seed to seed the random-number generator. If no such option is present, then Evaluator should seed the random-number generator using the default random seed. Then Evaluator must pass the random-number generator to the objects that need to use it to randomly-select examples, such as DataSet.
To help you get started, I put some class and method declarations in p2.zip.
Implement two separate executables: NaiveBayes and IBk. No windows. No menus. No prompts. Just do it.
The logic of each executable should be as follows. The user must provide a training set using the -t switch; a testing set is optional. If the user runs a learner and provides only a training set, then the program should evaluate the method on the training set using cross-validation and output the results. Naturally, the user can use the -x switch to change the default number of folds. The output should consist of the average accuracy and some measure of dispersion, such as variance, standard error, or a 95% confidence interval.
Otherwise, if the user provides training and testing sets using the -t and -T switches, respectively, then the program should build a model from the training set, evaluate it on the testing set, and output the results. The output should consist of the accuracy.

Instructions for Submission

In a file named HONOR, please include the statement:

In accordance with the class policies and Georgetown's Honor System,
I certify that, with the exceptions of the class resources and those
items noted below, I have neither given nor received any assistance
on this project.

Name
NetID

Include this file in your zip file submit.zip.

Submit p2 exactly like you submitted p1. Make sure you remove all debugging output before submitting.

Plan B

If Autolab is down, upload your zip file to Canvas.