COSC 288: Introduction to Machine Learning

Project 2
Spring 2009

Due: Fri, Feb 20 @ 10 P.M.
7 points

Implement k-NN and naive Bayes for symbolic attributes. Also implement routines for k-fold cross-validation. Use the so-called votes and mushroom data sets to evaluate the algorithms and your implementations.

Tasks:

  1. Building upon your implementation for p1, implement k-NN and naive Bayes. The implementations should be general in the sense that they should work for all data sets with symbolic attributes and symbolic class labels. Our convention is that the last attribute of the attribute declarations is the class label. For k-NN, use the switch -k to specify the value of k. Use 3 as the default.

  2. Implement k-fold cross-validation and integrate it into both implementations. Use the switch -x to specify the number of folds, with 10 being the default.

    Implement the learners as two separate executables. No windows. No menus. No prompts. Just do it.

    The logic of each implementation should be as follows. If the user runs a learner and specifies only a training set, then the program should evaluate using 10-fold cross-validation and output the results. Naturally, the user can use the -x switch to change the default. Otherwise, if the user specifies both a training and testing set, then the program should build a model from the training set, evaluate it on the testing set, and output the results.

    Your object-oriented design should be something that only a software engineer would love, appreciate, and cherish.

  3. Evaluate the algorithms and your implementations using 10-fold cross-validation. Use the mushroom data set: mushroom.mff, and the votes data set: votes.mff. Run naive Bayes (obviously) and k-NN, for k = 1, 3, 5, and 7. Place the output of these runs in a text file and include it with your submission. On a Unix machine, you can record program output using the script command, or you can direct the output to a file.

Instructions for Submission

In the header comments in at least the main file of your project, provide the following information:
//
// Name
// E-mail Address
// Platform: Windows, MacOS, Linux, Solaris, etc.
// Language/Environment: gcc, g++, java, g77, ruby, python, haskell, etc.
//
// In accordance with the class policies and Georgetown's Honor Code,
// I certify that, with the exceptions of the class resources and those
// items noted below, I have neither given nor received any assistance
// on this project.
//
Make sure I have clear instructions on how to run your executables. If you're using C or C++, then provide a Makefile.

Submit via Blackboard. When you are ready to submit your program for grading, create a compressed archive of a directory containing only your project's source, and upload it to Blackboard. The directory's name should be the same as your net ID. If you need to include a note with your submission, put the note in a README file in the directory.

For example, assume your net ID is ab123. If the directory p1 contains your project, then rename the directory to ab123.

To make the archive smaller, remove any object files, such as .class, a.out, and .o files.

Use zip, tar, or jar to create an archive:

% zip -r ab123.zip ab123/*
% tar -cf ab123.tar ab123
% jar -cf ab123.jar ab123
Use jar only for Java projects. If you use jar or tar, then compress the archive by typing
% gzip ab123.tar
% gzip ab123.jar
which creates a file ab123.tar.gz and ab123.jar.gz, respectively.

Upload the compressed archive to Blackboard.

Submit your project before 10:00 P.M. on the due date.