This is an old revision of the document!
Due: November 15th at 11pm
In this assignment you will compare several feature selection methods on several datasets. The first dataset is the Arcene dataset which was used in the 2003 NIPS feature selection competition. The dataset is produced by mass spectrometry of biological samples that comes from different types of cancer.
The second dataset describes the expression of human genes in two types of leukemia The original publication that describes the data:
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531, 1999.
Download a processed version of the dataset in libsvm format from the libsvm data repository. Look for the dataset named “leukemia”. There are two files, one a training set and another which contains a test set. Merge the two files into a single file for your experiments.
Implement a Python function that returns an array with the Golub score of a labeled dataset. Recall that the Golub score for feature $i$ is defined as:
$$
\frac{|\mu_i^{(+)} - \mu_i^{(-)}|}{\sigma_i^{(+)} + \sigma_i^{(-)}}
$$
where $\mu_i^{(+)}$ is the average of feature $i$ in the positive examples,
where $\sigma_i^{(+)}$ is the standard deviation of feature $i$ in the positive examples, and $\mu_i^{(-)}, \sigma_i^{(-)}$ are defined analogously for the negative examples.
In order for your function to work with the scikit-learn filter framework it needs to have two parameters: golub(X, y)
, where X is the feature matrix, and y is a vector of labels. All scikit-learn filter methods return two values - a vector of scores, and a vector of p-values. For our purposes, we won't use p-values associated with the Golub scores, so just return the computed vector of scores twice: if your vector of scores is stored in an array called scores, have the return statement be:
return scores,scores
The L1-SVM is an SVM that uses the L1 norm as the regularization term by replacing $w^Tw$ with $\sum_{i=1}^d |w_i|$. As discussed in class, the L1 SVM leads to very sparse solutions, and can therefore be used to perform feature selection.
Run the L1-SVM on the datasets mentioned above.
In scikit-learn use LinearSVC(penalty='l1', dual=False)
to create one.
How many features have non-zero weight vector coefficients? (Note that you can obtain the weight vector of a trained SVM by looking at its coef0
attribute.
Compare the accuracy of an L1 SVM to an SVM that uses RFE to select relevant features.
Compare the accuracy of a regular L2 SVM trained on those features with an L2 SVM trained on all the features computed using 5-fold cross-validation.
It has been argued in the literature that L1-SVMs often leads to solutions that are too sparse. As a workaround, implement the following strategy:
Do your results change if you do model selection for the resulting classifier over a grid of values for the soft margin constant $C$?
Submit the pdf of your report via Canvas. Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately. The latex sample document shows how to display Python code in a latex document. Code needs to be there so we can make sure that you implemented the algorithms and data analysis methodology correctly. Canvas allows you to submit multiple files for an assignment, so DO NOT submit an archive file (tar, zip, etc). Canvas will only allow you to submit pdfs (.pdf extension) or python code (.py extension). For this assignment there is a strict 8 page limit (not including references and code that is provided as an appendix). We will take off points for reports that go over the page limit. In addition to the code snippets that you include in your report, make sure you provide complete code from which we can see exactly how your results were generated.
A few general guidelines for this and future assignments in the course:
Grading sheet for assignment 3 Part 1: 40 points. (10 points): Primal SVM formulation is correct ( 7 points): Lagrangian found correctly ( 8 points): Derivation of saddle point equations (10 points): Derivation of the dual ( 5 points): Discussion of the implication of the form of the dual for SMO-like algorithms Part 2: 10 points. Part 3: 40 points. (20 points): Accuracy as a function of parameters and discussion of the results (15 points): Comparison of normalized and non-normalized kernels and correct model selection ( 5 points): Visualization of the kernel matrix and observations made about it Report structure, grammar and spelling: 10 points (10 points): Heading and subheading structure easy to follow and clearly divides report into logical sections. Code, math, figure captions, and all other aspects of the report are well-written and formatted. Grammar, spelling, and punctuation. Answers are clear and to the point.