COSC 388: Machine Learning

Project 1
Fall 2003

Due: Sep 15 @ 5 P.M.
4 points

  1. Design a file format for training and testing examples. The design should include information about attributes and classes and their domains. Users should be able to represent discrete (e.g., numCPU has four values), symbolic (e.g., color = {red, blue, green}), and continuous (e.g., weight) attributes and class labels. You may have users store this information in as few or as many files as deemed necessary, as long as they have standard and appropriate extensions, such as .names, .atts, .train, .egs, .data, .dta, .test, .tst, and the like. Do not assume that all attribute values will be present in the examples. Most implementations require either two or three files.

  2. Design an API for reading, checking, accessing, and manipulating examples, class labels, attributes, their values, and their domains.

  3. Implement the design using the language of your choosing (C, C++, Java, or Fortran 77); however, it must compile and run on gusun, cssun, or daruma. If using C, C++, or Fortran, use make. Makefile.c and Makefile.c++ are examples, but at some point before the due date, I'll conduct a tutorial on the mysterious world of makefiles. The implementation must be general, meaning that it should work for all possible data sets. Do not use static data structures.

  4. By the way, if I haven't already said it in class, start copying and sorting your e-mail into two folders: spam and nonspam. Project 5 will involve learning to detect spam, and it's tricky for me to provide you with e-mail.

  5. Back to P1, use the following data to test your program. Regardless of the number of files the implementation requires, the executable should read any needed parameters, such as the filestem from the command line:
    % readdata -f bikes
    % readdata -tr bikes.train -te bikes.test
    % readdata -t bikes.train -T bikes.test
    

    Make Tires Handle Bars Water Bottles Weight Bike Type
    Trek Knobby Straight 1 250.3 Mountain
    Bridgestone Treads Straight 2 200.1 Hybrid
    Cannondale Knobby Curved 0 222.9 Mountain
    Nishiki Treads Curved 1 190.3 Hybrid
    Trek Treads Straight 2 196.8 Hybrid

    All of these values are correct, but your implementation should perform light checks for data integrity.

Instructions for Submission: In the header comments, provide the following information:
//
// Name
// E-mail Address
// Platform: Windows, OS X, Redhat, Solaris (cssun/gusun/daruma), etc.
// Development Environment: gcc, g++, java, g77, etc.
// Mail Client: mailx, pine, GUMail, Netscape, Yahoo!, etc.
//
When you are ready to submit your program for grading, create a compressed archive of a directory containing your project and send it to me by e-mail as an attachment, as described below.

As an example, assume your net ID is ab123 your present working directory is $HOME/cosc388 which contains the directory p1. This directory, in turn, contains the files for your project.

To create such an archive, begin by creating new directory with a name following the format <netid>.p<project#>. You must use this naming convention for all projects this semester. To create the directory ab123.p1, type at the UNIX prompt (%):

% mkdir ab123.p1
Copy the project files into this directory:
% cp p1/* ab123.pl
Archive this directory:
% tar -cf ab123.tar ab123.p1
This will create the file ab123.tar, which will contain the directory ab123.p1 and its contents. You can look at its contents by typing
% tar -tf ab123.tar
Compress the archive by typing
% gzip ab123.tar
which creates the file ab123.tar.gz.

Attach this file to an e-mail with the subject "ab123.tar.gz" (no quotes).

Submit your project before 5:00 P.M. on the due date.

Once submitted, it is important to keep an electronic copy of your project on either cssun or gusun. These systems are regularly backed-up, and if we lose your project or the e-mail system breaks, then we will need to look at the modification date and time of your project to ensure that you submitted it before it was due. If you developed your code on a Windows machine, then use a secure ftp client to transfer your files or the archive to cssun or gusun.

Finally, when storing source code on university machines, it is important to set file permissions so others cannot read the file. To turn off such read/write permissions, type at the UNIX prompt chmod og-rw <file>, where <file> is the name of your source file.