Project 4
Fall 2002
Due: Nov 19 @ 5 P.M.
13 points
In class, we have discussed batch learning algorithms, which are appropriate for when we can gather all data relevant for training. However, there are many applications of machine learning in which we cannot collect all relevant data: intrusion detection, e-mail importance ranking, and intelligent user interfaces.
For these domains, we must use on-line learning algorithms, which learn from examples distributed over time and perform whenever is necessary. If an on-line algorithm modifies its existing concept descriptions, then it is incremental. k-nearest neighbors (k-NN) is inherently incremental. When a new example arrives, the incremental learning element simply adds the example to the store. When an unknown instance arrives, the performance element takes the new example into account by computing its distance from the unknown.
In his Ph.D. dissertation, Jeff Schlimmer studied incremental learning and introduced the Mushroom Data Set, derived from The Audubon Society Field Guide to North American Mushrooms. There are 5644 examples of poisonous and edible mushrooms characterized by 22 symbolic (or nominal) attributes. The goal is to learn to predict edible and poisonous mushrooms. The file mushroom.data.lisp contains the data set in Lisp format. Each list is an example (i.e., a mushroom). The first element is the class label: p = poisonous, e = edible. The rest of the list contains the 22 attribute values. The file mushroom.info describes these attributes and their domains.
In this project, you are to implement incremental versions of k-NN and naive Bayes and conduct an evaluation of these two methods using the Mushroom data set.
Tasks:
You can time the execution of a Lisp function using the time function. For example, if you have a function called classify and the list of testing examples bound to the atom testing, then to time the classify function you simply type: (time (classify testing)).
;;;; ;;;; k-NN, k = 3 ;;;; Overall Poisonous Edible Learn Time Test Time ;;;; 1 34.45+-1.5 15.6+-2.3 12.5+-0.2 0.0+-0.0 3.2+-1.2 ;;;; ... ;;;; 10 89.6+-0.2 98.5+-1.1 99.6+-0.01 0.0+-0.0 45.6+-0.5 ;;;; ;;;; naive Bayes ;;;; Overall Poisonous Edible Learn Time Test Time ;;;; ... ;;;;
Instructions for Submission: In the header comments, provide the following information:
;;;; ;;;; Name ;;;; E-mail Address ;;;; Platform: Windows, Linux, Solaris (cssun) ;;;; Lisp Environment: gcl, clisp, cmucl ;;;; Mail Client: mailx, pine, GUMail, Netscape, Yahoo!, etc. ;;;; ;;;; In accordance with the class policies and Georgetown's Honor Code, ;;;; I certify that, with the exceptions of the lecture notes and those ;;;; items noted below, I have neither given nor received any assistance ;;;; on this project. ;;;;When you are ready to submit your program for grading, e-mail it to me as one file with your net ID and the suffix ``.lisp'' as the subject line.
For example, if you were to submit using mailx on cssun, and if your net ID is ab123 and the name of your source file is proj1.lisp, then type at the UNIX prompt:
cssun% mailx -s "ab123.lisp" headdenw@cs < proj1.lispIf you use some other mail client, then follow the same instructions, and send your code as an attachment. Submit your project before 5:00 P.M. on the due date.
Once submitted, it is important to keep an electronic copy of your project on either cssun or gusun. If we lose your project or the e-mail system breaks, then we will need to look at the modification date and time of your project to ensure that you submitted it before it was due. If you developed your code on a Windows machine, then use a secure ftp client to transfer your file to cssun or gusun.
Finally, when storing source code on university machines, it is important to set file permissions so others cannot read the file. To turn off such read/write permissions, type at the UNIX prompt chmod og-rw <file>, where <file> is the name of your source file.