COSC 387 Project 4

COSC 387: Artificial Intelligence

Project 4
Fall 2002

Due: Nov 19 @ 5 P.M.
13 points

In class, we have discussed batch learning algorithms, which are appropriate for when we can gather all data relevant for training. However, there are many applications of machine learning in which we cannot collect all relevant data: intrusion detection, e-mail importance ranking, and intelligent user interfaces.

For these domains, we must use on-line learning algorithms, which learn from examples distributed over time and perform whenever is necessary. If an on-line algorithm modifies its existing concept descriptions, then it is incremental. k-nearest neighbors (k-NN) is inherently incremental. When a new example arrives, the incremental learning element simply adds the example to the store. When an unknown instance arrives, the performance element takes the new example into account by computing its distance from the unknown.

In his Ph.D. dissertation, Jeff Schlimmer studied incremental learning and introduced the Mushroom Data Set, derived from The Audubon Society Field Guide to North American Mushrooms. There are 5644 examples of poisonous and edible mushrooms characterized by 22 symbolic (or nominal) attributes. The goal is to learn to predict edible and poisonous mushrooms. The file mushroom.data.lisp contains the data set in Lisp format. Each list is an example (i.e., a mushroom). The first element is the class label: p = poisonous, e = edible. The rest of the list contains the 22 attribute values. The file mushroom.info describes these attributes and their domains.

In this project, you are to implement incremental versions of k-NN and naive Bayes and conduct an evaluation of these two methods using the Mushroom data set.

Tasks:

Implement an ``incremental'' version of k-NN. For the learning element, implement the function knn-train. For the performance element, implement the function knn-classify. Think about the appropriate interfaces for these functions. Aside from *data*, *train*, and *test*, do not use global variables.
Implement an incremental version of naive Bayes. Your implementation should be general, meaning that it should work for any data set with nominal attributes. For the learning element, implement the function nb-train. For the performance element, implement the function nb-classify. Think about the appropriate interfaces for these functions. Aside from *data*, *train*, and *test*, do not use global variables. Association lists may prove useful, as may incf.
Conduct an experiment to determine the best performing method using the data set in the file mushroom.data.lisp.
To perform the experiment, conduct 10 runs. For each run, randomly divide the examples into a training set (60%) and a testing set (40%). Process the examples in the training set by running naive Bayes and k-NN, for k = 1 and 3. For each 10% portion of the training data, compute the time spent learning, the time spent testing, the percent correct on the examples in the testing set, the percent correct for the poisonous class, and the percent correct for the edible class.
You can time the execution of a Lisp function using the time function. For example, if you have a function called classify and the list of testing examples bound to the atom testing, then to time the classify function you simply type: (time (classify testing)).
For each of the performance metrics (i.e., percent corrects, learning time, and testing time) and each algorithm, compute the average over the ten runs to determine the best performing method. As a measure of variability, also compute the standard deviation of each measure for each of the 10% portions over the ten runs.

In the header comments of your source file, submit all of your analysis. Which was the best performing method? Which would you recommend to a client, and why? Write numeric results in the following form:

;;;;
;;;; k-NN, k = 3
;;;;     Overall     Poisonous   Edible     Learn Time    Test Time
;;;; 1   34.45+-1.5   15.6+-2.3  12.5+-0.2    0.0+-0.0     3.2+-1.2
;;;; ...
;;;; 10  89.6+-0.2    98.5+-1.1  99.6+-0.01   0.0+-0.0    45.6+-0.5
;;;;
;;;; naive Bayes
;;;;     Overall     Poisonous   Edible     Learn Time   Test Time
;;;; ...
;;;;

Instructions for Submission: In the header comments, provide the following information:

;;;;
;;;; Name
;;;; E-mail Address
;;;; Platform: Windows, Linux, Solaris (cssun)
;;;; Lisp Environment: gcl, clisp, cmucl
;;;; Mail Client: mailx, pine, GUMail, Netscape, Yahoo!, etc.
;;;;
;;;; In accordance with the class policies and Georgetown's Honor Code,
;;;; I certify that, with the exceptions of the lecture notes and those
;;;; items noted below, I have neither given nor received any assistance
;;;; on this project.
;;;;

When you are ready to submit your program for grading, e-mail it to me as one file with your net ID and the suffix ``.lisp'' as the subject line.

For example, if you were to submit using mailx on cssun, and if your net ID is ab123 and the name of your source file is proj1.lisp, then type at the UNIX prompt:

cssun% mailx -s "ab123.lisp" headdenw@cs < proj1.lisp

If you use some other mail client, then follow the same instructions, and send your code as an attachment. Submit your project before 5:00 P.M. on the due date.

Once submitted, it is important to keep an electronic copy of your project on either cssun or gusun. If we lose your project or the e-mail system breaks, then we will need to look at the modification date and time of your project to ensure that you submitted it before it was due. If you developed your code on a Windows machine, then use a secure ftp client to transfer your file to cssun or gusun.

Finally, when storing source code on university machines, it is important to set file permissions so others cannot read the file. To turn off such read/write permissions, type at the UNIX prompt chmod og-rw <file>, where <file> is the name of your source file.