COSC 575: Machine Learning

Project 2
Spring 2015

Due: Thu, Feb 12 Sun, Feb 15 @ 11:59 P.M.
10 points

Building upon your implementation for p1, implement k-NN and naive Bayes for nominal and numeric attributes. Also implement routines for k-fold cross-validation.

The implementations should be general in the sense that they should work for all data sets with nominal and numeric attributes and nominal class labels. Our convention is that the last attribute of the attribute declarations is the class label.

Tasks:

  1. Design and implement a class hierarchy for classifiers and the learning methods for this project. We'll discuss your thoughts about possible designs in lecture. Based on the discussion thus far, good designs must implement

  2. Implement k-NN (or IBk). As discussed in class, k-NN should scale numeric attributes of training examples to the range [0, 1]. Your implementation does not have to use a fixed-capacity min-heap, but the data structure you use to determine the k-closest neighbors must use O(k) space. The implementation of k-NN should include the command-line switch -k to specify the value of k. Use 3 as the default.

  3. Implement naive Bayes. It should use add-one smoothing for nominal attributes. Implement Estimator and derive CategoricalEstimator and NormalEstimator. The derived classes must implement methods that add values to the estimators and get probabilities for values.

  4. Implement an Evaluator class that evaluates the performance of any Classifier using either k-fold cross-validation or the hold-out method. Use the option -x to specify the number of folds, with 10 being the default.

    Implement the learners as two separate executables: NaiveBayes and IBk. No windows. No menus. No prompts. Just do it.

    The logic of each implementation should be as follows. If the user runs a learner and specifies only a training set (using the -t switch), then the program should evaluate the method on the training set using 10-fold cross-validation and output the results. Naturally, the user can use the -x switch to change the default. Otherwise, if the user specifies both a training and testing set (using the -t and -T switches, respectively), then the program should use the hold-out method to build a model from the training set, evaluate it on the testing set, and output the results. The output should consist only of the accuracy or, in the case of k-fold cross validation, average accuracy and some measure of dispersion, such as variance, standard error, or a 95% confidence interval.

  5. It would be wise to test your implementations using all of the data sets in Blackboard, but for your submission, select a few of them, and evaluate the your implementations using 10-fold cross-validation. Run naive Bayes and k-NN, for k = 1, 3, 5, and 7. Place the output of these runs in a text file and include it with your submission.

Instructions for Submission

In the header comments in at least the main file of your project, provide the following information:
//
// Name
// E-mail Address
// Platform: Windows, MacOS, Linux, Solaris, etc.
// Language/Environment: gcc, g++, java, g77, ruby, python, etc.
//
// In accordance with the class policies and Georgetown's Honor Code,
// I certify that, with the exceptions of the class resources and those
// items noted below, I have neither given nor received any assistance
// on this project.
//

Submit via Blackboard. When you are ready to submit your program for grading, create a zip file of a single-level directory containing only your project's source, and upload it to Blackboard. The directory's name should be the same as your net ID. The zip file should be named p2.zip. If you need to include a note with your submission, put the note in a README file in the directory. Make sure I have clear instructions on how to build and run your executables. If you're using C or C++, provide a Makefile. If you're using Java, do not use packages. Make sure compiling your project produces two executables named NaiveBayes and IBk with the appropriate extension.