COSC 288: Introduction to Machine Learning

Project 1
Spring 2009

Due: Fri, Feb 6 @ 5 PM
4 points

  1. Write a program to read, parse, and store examples in Mark's File Format (mff). Consider the following version of the Bikes data set:

    Make Tires Handle Bars Water Bottle Weight Bike Type
    Trek Knobby Straight y 250.3 Mountain
    Bridgestone Treads Straight y 200 Hybrid
    Cannondale Knobby Curved n 222.9 Mountain
    Nishiki Treads Curved y 190.3 Hybrid
    Trek Treads Straight y 196.8 Hybrid

    The file bikes.mff shows this data set in Mark's File Format. Files containing valid data sets begin with '@dataset' followed by an identifier. Attribute declarations appear next. The string '@attribute' precedes each declaration, which is for a symbolic attribute or a numeric attribute. The attribute's name appears next, followed by its domain. The domain for symbolic attributes is a list of values separated by whitespace. The domain for numeric attributes is not explicitly specified and is assumed to be the set of representable floating-point numbers.

    The token '@examples' separates the attribute declarations from the examples, which are simply values separated by whitespace.

    For simplicity, you can assume that all elements of the file are separated by at least one space character. Moreover, attribute declarations and examples will appear on single lines.

  2. In subsequent projects, operations on the data set you will need to perform include I would recommend implementing two structures: one for the information about attributes, and one for the examples themselves.

  3. The program should take input from the command line. Use -t to specify the name of the training file and use -T to specify the name of the testing file, if any.
    % readdata -t bikes.mff
    % readdata -t bikes.train
    % readdata -t bikes.train -T bikes.test
    % readdata -t bikes-tr.mff -T bikes-te.mff
    
    The program should perform light checks for proper formatting and data integrity. For this project, the program can simply output the examples of the input files to the console.

  4. Implement the program using the language of your choosing (e.g., ANSI C, ANSI C++, Java, Fortran 77, Lisp, ruby, python); however, it must compile and run on my Unix machine. If using C, C++, or Fortran, use make. Makefile.c and Makefile.c++ are examples, but at some point before the due date, I'll conduct a tutorial on the mysterious world of makefiles. The implementation must be general, meaning that it should work for all possible data sets. Do not use static data structures. You must implement your program using the standard libraries the language provides. If you want to use something non-standard, check with me first.

Instructions for Submission

In the header comments in at least the main file of your project, provide the following information:
//
// Name
// E-mail Address
// Platform: Windows, OS X, Linux, Solaris (daruma), etc.
// Language/Environment: gcc, g++, java, g77, ruby, python, haskell, etc.
//
// In accordance with the class policies and Georgetown's Honor Code,
// I certify that, with the exceptions of the class resources and those
// items noted below, I have neither given nor received any assistance
// on this project.
//
Let's try something different this semester. Let's try submitting via Blackboard. When you are ready to submit your program for grading, create a compressed archive of a directory containing only your project's source, and upload it to Blackboard. The directory's name should be the same as your net ID. If you need to include a note with your submission, put the note in a README file in the directory.

For example, assume your net ID is ab123. If the directory p1 contains your project, then rename the directory to ab123.

To make the archive smaller, remove any object files, such as .class, a.out, and .o files.

Use zip, tar, or jar to create an archive:

% zip -r ab123.zip ab123/*
% tar -cf ab123.tar ab123
% jar -cf ab123.jar ab123
Use jar only for Java projects. If you use jar or tar, then compress the archive by typing
% gzip ab123.tar
% gzip ab123.jar
which creates a file ab123.tar.gz and ab123.jar.gz, respectively.

Upload the compressed archive to Blackboard.

Submit your project before 5:00 P.M. on the due date.

N.B. Blackboard changed something around. I believe you upload your assignment through the Grade Center, although I'm not entirely sure because I don't know how assignments appear in a student's view. Check soon, and let me know if I need to add something from the instructor's view.