COSC 688: Machine Learning

Project 1
Fall 2007

Due: Oct 4 @ 5 P.M.
15 points

Implement incremental versions of k-NN and naive Bayes, and conduct a proper evaluation of these two methods using the Pima Indians Diabetes data set and the 1984 Congressional Voting Record data set. Information about these data sets appears in the files pima.names and votes.names, respectively. (Note that in the votes data set, I changed '?' to 'u'.) Since you know how to read and parse files, I have taken the liberty of creating static arrays containing the data sets: pima.h and votes.h. Feel free to adjust the representation of the data sets to suit your needs, but you may not modify the data itself.

To evaluate the methods, implement and use ten-fold cross-validation. Compare naive Bayes to k-NN, for k = 1, 3, 5, and 7. That is, for each data set, compare naive Bayes to k-NN, for k = 1, then compare naive Bayes to k-NN, for k = 3, and so on. As performance metrics, compute accuracy, true-positive rate, and false-positive rate. In addition to calculating the average of these metrics over the ten folds, also calculate a measure of dispersion. Finally, implement and apply a t-test to determine whether there is a statistically significant difference between the performances of each pairwise comparison.

The main function should run and evaluate the methods on the data sets, as described previously. It should write clean, readable, well-organized output to the screen. In a text file, discuss and analyze the experimental results. Issues you should address include

• Which method performed (in terms of accuracy) the best and why?
• Which value of k produced the best results and why?
• What were the average run times of the learning and performance elements?
• How much of the overall accuracy was due to superior performance on one of the classes?
• What questions do the results raise?
• How might you pursue such questions in future projects?

Since I am assuming that your design and implementation will follow standard practice, I will not award points for such practice. However, if the design or implementation is subpar, then there will be a deduction. You do not have to document your code. Feel free to use any language, but I must be able to compile and execute your code on my machine. If you plan on using something that is proprietary, non-standard, or exotic, you should talk to me first. The implementations must be general, meaning that they must work for any similarly represented data set with numeric and symbolic attributes and symbolic class labels. Finally, if you consult sources outside the class materials, you must cite those sources. As I said in class, I will grade you on the quality of your sources. You may not consult with anyone other than me about the project, and you may not consult or use someone else's source code or implementation. Feel free to contact me with questions, but it would be best to discuss any issues in class so everyone can contribute and benefit.

```/**
* Name:
* Platform: Windows, OS X, Linux (seva), Solaris, bsd, etc.
* Language/Environment: gcc, g++, java, python, ruby, clisp, g77, g95, etc.
*
* In accordance with the class policies and Georgetown's Honor Code,
* I certify that, with the exceptions of the class resources and those
* items noted below, I have neither given nor received any assistance
* on this project.
*/
```
When you are ready to submit your program for grading, create a compressed archive of a directory containing only your project's source, and send it to me by e-mail as an attachment. The directory's name should be the same as your net ID.

For example, assume your net ID is ab123. If the directory p1 contains your project, then rename the directory to ab123.

To make the archive smaller, remove any object files, such as .class, a.out, and .o files.

Use zip, tar, or jar to create an archive:

```% zip ab123.tar ab123/*
% tar -cf ab123.tar ab123
% jar -cf ab123.jar ab123
```
Use jar only for Java projects. If you use jar or tar, then compress the archive by typing
```% gzip ab123.tar
% gzip ab123.jar
```
which creates a file ab123.tar.gz and ab123.jar.gz, respectively.

N.B. If you use zip, then you need to change the extension of your file to something other than .zip, as UIS strips .zip attachments. The extension .piz works pretty well. So you'd rename ab123.zip to ab123.piz.

Attach the file containing your project to an e-mail and send it to me.

Make sure you send a carbon copy of your project to yourself, so you'll have a record of when you submitted your project. Ideally, also keep a copy on a university or department machine. However, make sure that your archive, directory, or files are not readable by others.

Submit your project before 5:00 P.M. on the due date.