Due date: 9/17 at 3:30pm
In this assignment we will work with two datasets from the UCI repository:
Heart disease diagnosis dataset. To make it simpler for you to read the dataset, there is a processed version of this dataset, which will be easier to load into PyML
. Note that there are two versions of the dataset on the libsvm repository: a raw unprocessed version, and a version where the features have been normalized to be in the range [-1, 1]. This data is in sparse format, which is described in the PyML tutorial. To read this dataset into a PyML
dataset use the following commands:
In [1]: from PyML import * In [2]: data = vectorDatasets.load_libsvm_format("heart")
The Gisette handwritten digit recognition dataset. In this case the feature data matrix is provided separately from the labels, and the feature matrix is a delimited file, which the PyVectorDataSet
container handles directly. So all you need to do is something like:
In [3]: data = vectorDatasets.PyVectorDataSet("gisette_train.data")
Now you will need to read the labels and attach them to the dataset you created:
In [4]: data.attachLabels(Labels("gisette_train.labels"))
When implementing the classifier you will find it useful to extract the examples that are associated with a given class.
That information is stored in the Labels
object associated with a dataset as
data.labels.classes
.
This is a list of lists. Each element in the list is the list of indexes of the examples that belong to a given class.
For examples, data.labels.classes[0]
is the list of examples in the first class.
Implement the nearest centroid classifier in Python, and use the code for the perceptron algorithm as a template.
Generate toy 2-d dataset using PyML's demo2d
module (see the PyML tutorial for details) and illustrate that your classifier is working correctly. Note that since we are using the Numpy based data container we need to invoke the getData method as: demo2d.getData(numpy_container=True)
.
Once you are satisfied that you have a working classifier, compare its
accuracy with that of the perceptron on the two datasets referred to above.
Whenever we learn a classifier it is useful to know if we have collected a sufficient amount of data for accurate classification. A good way of determining that is to construct a learning curve, which is a plot of classifier accuracy as a function of the number of training examples. Plot a learning curve for the nearest centroid and perceptron classifiers using the Gisette dataset. The x-axis for the plot (number of training examples) should be on a logarithmic scale - something like 10,20,40,80,200,400,800. Use numbers that are appropriate for the dataset at hand, choosing values that illustrate the variation that you observe. What can you conclude from the learning curve you have constructed?
In this section we will explore the effect of normalizing the data, focusing on normalization of features (e.g. standardizing).
Here's what you need to do:
Your report needs to be written in LaTex. Here are some files to help you start playing with LaTex and writing your report. Download and extract the files from start_latex.tar. You will now have the following files:
The Makefile contains commands required for generating a pdf file out of the latex source, and other files that are required. On a Unix/Linux that has Latex you can just run
> make
The file listings-python-options.sty
is a latex style file that tweaks the parameters of the listings
latex package to makes it such that Python code is displayed nicely.
Submit your report via RamCT. Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately. The latex sample document shows how to display Python code in a latex document. Also, please check-in a text file named README that describes what you found most difficult in completing this assignment (or provide that as a comment on ramct).
Here is what the grade sheet will look like for this assignment. A few general guidelines for this and future assignments in the course:
Grading sheet for assignment 1 Part 1: 40 points. (15 points): Correct implementation of nearest centroid classifier (15 points): Good protocol for evaluating classifier accuracy; results are provided in a clear and concise way (10 points): Discussion Part 2: 25 points. (15 points): Learning curves are correctly generated and displayed in a clear and readable way (10 points): Discussion Part 3: 20 points. (10 points): Comparison of normalized/raw data results; discussion of results ( 5 points): How to perform data scaling ( 5 points): Comparison of scaling with standardizing Report structure, grammar and spelling: 15 points ( 5 points): Heading and subheading structure easy to follow and clearly divides report into logical sections. ( 5 points): Code, math, figure captions, and all other aspects of report are well-written and formatted. ( 5 points): Grammar, spelling, and punctuation.