Assignment 1

Assignment 1

Due date: 9/17 at 3:30pm

Preliminaries: reading data into PyML

In this assignment we will work with two datasets from the UCI repository:

Heart disease diagnosis dataset. To make it simpler for you to read the dataset, there is a processed version of this dataset, which will be easier to load into PyML. Note that there are two versions of the dataset on the libsvm repository: a raw unprocessed version, and a version where the features have been normalized to be in the range [-1, 1]. This data is in sparse format, which is described in the PyML tutorial. To read this dataset into a PyML dataset use the following commands:

In [1]: from PyML import *
In [2]: data = vectorDatasets.load_libsvm_format("heart")

The Gisette handwritten digit recognition dataset. In this case the feature data matrix is provided separately from the labels, and the feature matrix is a delimited file, which the PyVectorDataSet container handles directly. So all you need to do is something like:

In [3]: data = vectorDatasets.PyVectorDataSet("gisette_train.data")

Now you will need to read the labels and attach them to the dataset you created:

In [4]:  data.attachLabels(Labels("gisette_train.labels"))

When implementing the classifier you will find it useful to extract the examples that are associated with a given class. That information is stored in the Labels object associated with a dataset as data.labels.classes. This is a list of lists. Each element in the list is the list of indexes of the examples that belong to a given class. For examples, data.labels.classes[0] is the list of examples in the first class.

Part 1: Nearest Centroid Classifier

Implement the nearest centroid classifier in Python, and use the code for the perceptron algorithm as a template. Generate toy 2-d dataset using PyML's demo2d module (see the PyML tutorial for details) and illustrate that your classifier is working correctly. Note that since we are using the Numpy based data container we need to invoke the getData method as: demo2d.getData(numpy_container=True). Once you are satisfied that you have a working classifier, compare its accuracy with that of the perceptron on the two datasets referred to above.

Part 2: Learning Curves

Whenever we learn a classifier it is useful to know if we have collected a sufficient amount of data for accurate classification. A good way of determining that is to construct a learning curve, which is a plot of classifier accuracy as a function of the number of training examples. Plot a learning curve for the nearest centroid and perceptron classifiers using the Gisette dataset. The x-axis for the plot (number of training examples) should be on a logarithmic scale - something like 10,20,40,80,200,400,800. Use numbers that are appropriate for the dataset at hand, choosing values that illustrate the variation that you observe. What can you conclude from the learning curve you have constructed?

Part 3: Data normalization

In this section we will explore the effect of normalizing the data, focusing on normalization of features (e.g. standardizing).

Here's what you need to do:

Compare the accuracy of the perceptron and nearest centroid classifiers on the two versions of the dataset. Does one of them perform better? Explain why you think this happens.
The scaled version of the data is normalized such that features are in the range [-1, 1]. Explain how to do that.
An alternative way of normalizing the data is to standardize it: for each feature subtract the mean and divide by its standard deviation. What can you say about the range of the data in this case? There are some conditions when standardization is a better idea than scaling variables to the range [-1, 1]. Can you think when that is the case?

Your Report

Your report needs to be written in LaTex. Here are some files to help you start playing with LaTex and writing your report. Download and extract the files from start_latex.tar. You will now have the following files:

start.tex
listings-python-options.sty
start.bib
wowTest.py
sinecurve.py
sine.pdf
Makefile

The Makefile contains commands required for generating a pdf file out of the latex source, and other files that are required. On a Unix/Linux that has Latex you can just run

> make

The file listings-python-options.sty is a latex style file that tweaks the parameters of the listings latex package to makes it such that Python code is displayed nicely.

Submission

Submit your report via RamCT. Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately. The latex sample document shows how to display Python code in a latex document. Also, please check-in a text file named README that describes what you found most difficult in completing this assignment (or provide that as a comment on ramct).

Grading

Here is what the grade sheet will look like for this assignment. A few general guidelines for this and future assignments in the course:

Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description. You can use a few lines of python code or pseudo-code. If your code is more than a few lines, you can include it as an appendix to your report. For example, for the first part of the assignment, provide the protocol you use to evaluate classifier accuracy.
You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem. There are no rules about how much space each answer should take. BUT we will take off points if we have to wade through a lot of redundant data.
In any machine learning paper there is a discussion of the results. There is a similar expectation from your assignments that you reason about your results. For example, for the learning curve problem, what can you say on the basis of the observed learning curve?

Grading sheet for assignment 1

Part 1:  40 points.
(15 points):  Correct implementation of nearest centroid classifier
(15 points):  Good protocol for evaluating classifier accuracy; results are provided in a clear and concise way
(10 points):  Discussion

Part 2:  25 points.
(15 points):  Learning curves are correctly generated and displayed in a clear and readable way
(10 points):  Discussion

Part 3:  20 points.
(10 points):  Comparison of normalized/raw data results; discussion of results
( 5 points):  How to perform data scaling
( 5 points):  Comparison of scaling with standardizing

Report structure, grammar and spelling:  15 points
( 5 points):  Heading and subheading structure easy to follow and
              clearly divides report into logical sections.
( 5 points):  Code, math, figure captions, and all other aspects of  
              report are well-written and formatted.
( 5 points):  Grammar, spelling, and punctuation.

Table of Contents