COSC 575: Machine Learning

Project Guidelines

This research project is designed to let you concentrate on a problem of interest that is related to machine learning. It consists of three parts, two of which are graded: the prospectus, the presentation, and the paper.

When submitting these documents, hard copies are fine. If you submit electronically, the documents must be in Adobe's Portable Document Format (PDF).

Types of Studies in Machine Learning

There are three types of studies in machine learning:
  1. an application
  2. a theoretical study
  3. an experimental study
An application usually involves using existing machine learning methods to solve a new problem in some domain. A theoretical study involves extending an existing theoretical result or proving a new theoretical result. An experimental study typically involves evaluating with a new perspective a number of existing methods, replicating and meaningfully extending a published study, or designing a new learning method and demonstrating that it outperforms existing methods.

Since I am an experimentalist, I am in the best position to evaluate and help with experimental studies. Developing a new learning method might be a bit too ambitious for a first or semester-long class in machine learning, so I anticipate that most students will choose to extend an existing study or conduct an evaluating to characterize some aspect of existing methods.

Experimental Studies

Experimental studies play a critical role in machine learning. These studies involve designing, conducting, analyzing, and writing about one or more experiments that support or refute a research hypothesis.

An experiment is a systematic procedure designed to test a research hypothesis. Let us say an experiment consists of six elements: an experimental condition, a control condition, a set of independent variables, a set of dependent variables, and a set of measures for each dependent variable. The experimental procedure consists of systematically varying the independent variables and measuring the dependent variables for the experimental and control conditions to determine if the outcome of the experiment supports or refutes the research hypothesis.

As an example, assume that I hypothesize that the gain ratio is the best method for attribute selection for tree induction. To design an experiment to test this hypothesis, I select as the dependent variable the data sets on which I will run a tree-induction algorithm. For example, I could select a large number of data sets from the UCI Repository.

I select as the independent variable the performance of the induced trees on testing data. I decide to measure performance using accuracy. As a control condition, I use the same tree induction algorithm, but use other methods for attribute selection, such as the Gini coefficient and the misclassification rate.

To conduct this experiment, my experimental method consists of using ten-fold cross-validation to evaluate the decision-tree algorithms over all of the data sets. If the accuracy for all runs of the algorithm with infromation gain is higher than that for all of the runs of the algorithms with other attribute selection methods, then this outcome supports my hypothesis. Otherwise, the outcome refutes my hypothesis.

It is unlikely that all runs will be superior. What do we conclude if 60% of the runs support our hypothesis? What about 50%? For any percentage, are the observed differences statistically significant?

Some experiments include a randomized control. For example, we might implement a decision tree algorithm that randomly selects attributes at each level.

Some experiments examine confounding variables. For example, does one measure affect the pruning algorithm in a way that another measure does not. How do we measure or control for this confounding variable?

Assignment

Identify a research hypothesis. Conduct a literature search to identify at least three high-quality, peer-reviewed articles that are directly relevant to your hypothesis and study. Design an experiment to support or refute the research hypothesis. Conduct the experiment. Feel free to use WEKA or RapidMiner to conduct your study. Use real or synthetic data sets as appropriate. Analyze the results, and present them in an appropriate form, such as in graphs or tables. All measures should be accompanied by some measure of variability. You do not have to conduct a statistical hypothesis test.

Submit a report that details your research hypothesis, experimental design, analysis, and conclusions. The report must include a bibliography with complete entries. The report must be in PDF. It must not exceed five pages, have margins less than one inch, and a font size smaller than ten points. Along with the report, submit any code that you write, but you do not have to submit data sets, unless they are of your design. Upload everything as a zip file to Blackboard. If it is late, I'll deduct 1% for every minute.

To assess your project, I will consider three aspects of your study. I will consider the degree to which someone skilled in the art of machine learning could reproduce your study and its results based on the written report. I will consider the soundness of the experimental methodology that you used to conduct the study. I will evaluate the completeness of the experimental study that you performed. For example, a study consisting of two algorithms and one data set probably will not pass muster.

Ideas

A list of ideas, projects, or starting points appears below. Most if not all of the ideas below have been published, but I am more interested in having you conduct a solid experimental study than worrying about doing something novel, significant, and publishable. After all, chance favors the prepared mind.
Go Back