This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
assignments:assignment5 [2015/10/29 13:52] asa |
assignments:assignment5 [2015/10/31 09:03] asa |
||
---|---|---|---|
Line 3: | Line 3: | ||
Due: November 15th at 11pm | Due: November 15th at 11pm | ||
- | In this assignment we will compare several feature selection methods on several datasets. | + | In this assignment you will compare several feature selection methods on several datasets. |
- | The datasets we will use are the yeast gene expression dataset | + | The first dataset is the [[https://archive.ics.uci.edu/ml/datasets/Arcene| Arcene]] dataset which was used in the 2003 NIPS feature selection competition. The dataset is produced by mass spectrometry of biological samples that comes from different types of cancer. |
+ | |||
+ | The second dataset describes the expression of human genes in two types of leukemia The original publication that describes the data: | ||
+ | |||
+ | T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. | ||
+ | [[https://www.broadinstitute.org/mpr/publications/projects/Leukemia/Golub_et_al_1999.pdf | Molecular classification of cancer: class discovery and class prediction by gene expression monitoring]]. | ||
+ | Science, 286(5439):531, 1999. | ||
===== Part 1: Filter methods ===== | ===== Part 1: Filter methods ===== | ||
Line 17: | Line 23: | ||
In order for your function to work with the scikit-learn filter framework it needs to have two parameters: ''golub(X, y)'', where X is the feature matrix, and y is a vector of labels. All scikit-learn filter methods return two values - a vector of scores, and a vector of p-values. For our purposes, we won't use p-values associated with the Golub scores, so just return the computed vector of scores twice: if your vector of scores is stored in an array called scores, have the return statement be: | In order for your function to work with the scikit-learn filter framework it needs to have two parameters: ''golub(X, y)'', where X is the feature matrix, and y is a vector of labels. All scikit-learn filter methods return two values - a vector of scores, and a vector of p-values. For our purposes, we won't use p-values associated with the Golub scores, so just return the computed vector of scores twice: if your vector of scores is stored in an array called scores, have the return statement be: | ||
- | <code python> | + | ''return scores,scores'' |
- | return scores,scores | + | |
- | </code> | + | |
+ | |||
+ | ===== Part 2: Embedded methods: L1 SVM ===== | ||
+ | |||
+ | The L1-SVM is an SVM that uses the L1 norm as the regularization term by replacing $w^Tw$ with $\sum_{i=1}^d |w_i|$. As discussed in class, the L1 SVM leads to very sparse solutions, and can therefore be used to perform feature selection. | ||
+ | |||
+ | Run the L1-SVM on the datasets mentioned above. | ||
+ | In scikit-learn use ''LinearSVC(penalty='l1', dual=False)'' to create one. | ||
+ | How many features have non-zero weight vector coefficients? (Note that you can obtain the weight vector of a trained SVM by looking at its ''coef0'' attribute. | ||
+ | Compare the accuracy of an L1 SVM to an SVM that uses RFE to select relevant features. | ||
+ | |||
+ | Compare the accuracy of a regular L2 SVM trained on those features with an L2 SVM trained on all the features computed using 5-fold cross-validation. | ||
+ | |||
+ | It has been argued in the literature that L1-SVMs often leads to solutions that are too sparse. As a workaround, implement the following strategy: | ||
+ | |||
+ | * Create $k$ sub-samples of the data in which you randomly choose 80% of the examples. | ||
+ | * For each sub-sample train an L1-SVM. | ||
+ | * For each feature compute a score that is the average weight vector | ||
- | ===== Part 2: Comparison of filter and embedded methods ===== | ||
Do your results change if you do model selection for the resulting classifier over a grid of values for the soft margin constant $C$? | Do your results change if you do model selection for the resulting classifier over a grid of values for the soft margin constant $C$? | ||
+ | |||
+ | |||
===== Submission ===== | ===== Submission ===== |