COSC 575: Machine Learning

Project 2
Spring 2012

Due: Thu, Feb 23 @ 8 P.M.
10 points

Building upon your implementation for p1, implement k-NN and naive Bayes for symbolic, discrete, and numeric attributes. Also implement routines for k-fold cross-validation.

The implementations should be general in the sense that they should work for all data sets with symbolic, discrete, and numeric attributes and symbolic class labels. Our convention is that the last attribute of the attribute declarations is the class label.

Tasks:

  1. Design and implement a class hierarchy for classifiers and the learning methods for this project.

  2. Implement k-NN (or IBk). As discussed in class, k-NN should scale numeric attributes of training examples to the range [0, 1]. Your implementation does not have to use a fixed-capacity min-heap, but the data structure you use to determine the k-closest neighbors must use O(k) space. The implementation of k-NN should include the command-line switch -k to specify the value of k. Use 3 as the default.

  3. Implement naive Bayes. The implementation should handle numeric attributes using either parametric or nonparametric density estimation. It should handle symbolic and discrete attributes using either add-one smoothing or m-estimates.

  4. Implement k-fold cross-validation and integrate it into both implementations. Make sure that the methods for cross-validation belong to the correct class. Use the switch -x to specify the number of folds, with 10 being the default.

    Implement the learners as two separate executables. No windows. No menus. No prompts. Just do it.

    The logic of each implementation should be as follows. If the user runs a learner and specifies only a training set (using the -t switch), then the program should evaluate the method on the training set using 10-fold cross-validation and output the results. Naturally, the user can use the -x switch to change the default. Otherwise, if the user specifies both a training and testing set (using the -t and -T switches, respectively), then the program should build a model from the training set, evaluate it on the testing set, and output the results. The output can consist of the accuracy or, in the case of k-fold cross validation, average accuracy and some measure of dispertion, such as variance, standard error, or a 95% confidence interval.

  5. It would be wise to test your implementations using all of the data sets in Blackboard, but for your submission, select a few of them, and evaluate the your implementations using 10-fold cross-validation. Run naive Bayes and k-NN, for k = 1, 3, 5, and 7. Place the output of these runs in a text file and include it with your submission.

Instructions for Submission

In the header comments in at least the main file of your project, provide the following information:
//
// Name
// E-mail Address
// Platform: Windows, MacOS, Linux, Solaris, etc.
// Language/Environment: gcc, g++, java, g77, ruby, python, haskell, etc.
//
// In accordance with the class policies and Georgetown's Honor Code,
// I certify that, with the exceptions of the class resources and those
// items noted below, I have neither given nor received any assistance
// on this project.
//
Make sure I have clear instructions on how to build and run your executables. If you're using C or C++, then provide a Makefile.

Submit via Blackboard. When you are ready to submit your program for grading, create a compressed archive of a directory containing only your project's source, and upload it to Blackboard. The directory's name should be the same as your net ID. If you need to include a note with your submission, put the note in a README file in the directory. Submit your project before 8:00 P.M. on the due date.