COSC-575 Project 2

COSC-575: Machine Learning

Project 2
Spring 2017

Due: F 2/17 @ 5:00 P.M.
10 points

Building upon your implementation for p1, implement k-NN and naive Bayes for nominal and numeric attributes. Also implement routines for evaluating classifier using k-fold cross-validation.

The implementations should be general in the sense that they should work for all data sets with nominal and numeric attributes and nominal class labels. Our convention is that the last attribute of the attribute declarations is the class label.

Tasks:

Design and implement a class hierarchy for classifiers and the learning methods for this project. We'll discuss your thoughts about possible designs in lecture. Based on the discussion thus far, good designs must implement
- void train( DataSet )
- Performance classify( DataSet )
- int classify( Example )
- double[] getDistribution( Example )
Implement k-NN as the class IBk. As discussed in class, k-NN should scale numeric attributes of training examples to the range [0, 1]. Your implementation does not have to use a fixed-capacity min-heap, but the data structure you use to determine the k-closest neighbors must use O(k) space. The implementation of k-NN should process the command-line switch -k to let users specify the value of k. Use 3 as the default.
Implement naive Bayes as the class NaiveBayes. It should use add-one smoothing for nominal attributes. Implement Estimator and derive CategoricalEstimator and GaussianEstimator. The derived classes must implement methods that add values to the estimators and get probabilities for values.
Implement an Evaluator class that evaluates the performance of any Classifier using training and testing sets or using k-fold cross-validation. Use the option -x to let users specify the number of folds, with 10 as the default.
To help you get started, I put some class and method declarations in p2.zip.
Implement two separate executables: NaiveBayes and IBk. No windows. No menus. No prompts. Just do it.
The logic of each executable should be as follows. The user must provide a training set using the -t switch; a testing set is optional. If the user runs a learner and provides only a training set, then the program should evaluate the method on the training set using cross-validation and output the results. Naturally, the user can use the -x switch to change the default number of folds. The output should consist of the average accuracy and some measure of dispersion, such as variance, standard error, or a 95% confidence interval.
Otherwise, if the user provides training and testing sets using the -t and -T switches, respectively, then the program should build a model from the training set, evaluate it on the testing set, and output the results. The output should consist of the accuracy.

Instructions for Submission

In the header comments in at least the main file of your project, provide the following information:

//
// Name
// E-mail Address
// Platform: Windows, MacOS, Linux, Solaris, etc.
//
// In accordance with the class policies and Georgetown's Honor Code,
// I certify that, with the exceptions of the class resources and those
// items noted below, I have neither given nor received any assistance
// on this project.
//

Submit via Autolab. When you are ready to submit your program for grading, create a zip file named submit.zip containing only your project's source files. You have two chances to compile and run against the autograding routines.

Plan B

If Autolab is down, upload your zip file to Blackboard.