This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
assignments:assignment5 [2015/10/30 19:10] asa |
assignments:assignment5 [2015/10/31 09:36] asa |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ========= Assignment 5: Feature selection ============ | + | ======== Assignment 5: Feature selection =========== |
Due: November 15th at 11pm | Due: November 15th at 11pm | ||
- | In this assignment we will compare several feature selection methods on several datasets. | + | |
- | The datasets we will use are the yeast gene expression dataset | + | ===== Data ===== |
+ | |||
+ | In this assignment you will compare several feature selection methods on several datasets. | ||
+ | The first dataset is the [[https://archive.ics.uci.edu/ml/datasets/Arcene| Arcene]] dataset which was used in the 2003 NIPS feature selection competition. The dataset is produced by mass spectrometry of biological samples that comes from different types of cancer. | ||
+ | |||
+ | The second dataset describes the expression of human genes in two types of leukemia The original publication that describes the data: | ||
+ | |||
+ | T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. | ||
+ | [[https://www.broadinstitute.org/mpr/publications/projects/Leukemia/Golub_et_al_1999.pdf | Molecular classification of cancer: class discovery and class prediction by gene expression monitoring]]. | ||
+ | Science, 286(5439):531, 1999. | ||
+ | |||
+ | Download a processed version of the dataset in libsvm format from the [[https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html | libsvm data repository]]. Look for the dataset named "leukemia". There are two files, one a training set and another which contains a test set. Merge the two files into a single file for your experiments. | ||
===== Part 1: Filter methods ===== | ===== Part 1: Filter methods ===== | ||
Line 15: | Line 26: | ||
where $\mu_i^{(+)}$ is the average of feature $i$ in the positive examples, | where $\mu_i^{(+)}$ is the average of feature $i$ in the positive examples, | ||
where $\sigma_i^{(+)}$ is the standard deviation of feature $i$ in the positive examples, and $\mu_i^{(-)}, \sigma_i^{(-)}$ are defined analogously for the negative examples. | where $\sigma_i^{(+)}$ is the standard deviation of feature $i$ in the positive examples, and $\mu_i^{(-)}, \sigma_i^{(-)}$ are defined analogously for the negative examples. | ||
- | In order for your function to work with the scikit-learn filter framework it needs to have two parameters: ''golub(X, y)'', where X is the feature matrix, and y is a vector of labels. All scikit-learn filter methods return two values - a vector of scores, and a vector of p-values. For our purposes, we won't use p-values associated with the Golub scores, so just return the computed vector of scores twice: if your vector of scores is stored in an array called scores, have the return statement be: | + | In order for your function to work with the scikit-learn filter framework it needs to have two parameters: ''golub(X, y)'', where X is the feature matrix, and y is a vector of labels. All scikit-learn filter methods return two values - a vector of scores, and a vector of p-values. For our purposes, we won't use p-values associated with the Golub scores, so just return the computed vector of scores twice (''return scores,scores'' if your vector of scores is stored in an array called scores) |
- | + | ||
- | ''return scores,scores'' | + | |
- | + | ||
===== Part 2: Embedded methods: L1 SVM ===== | ===== Part 2: Embedded methods: L1 SVM ===== | ||
Line 30: | Line 37: | ||
Compare the accuracy of an L1 SVM to an SVM that uses RFE to select relevant features. | Compare the accuracy of an L1 SVM to an SVM that uses RFE to select relevant features. | ||
- | Compare the accuracy of a regular L2 SVM trained on those features with an L2 SVM trained on all the features computed using 5-fold cross-validation. | + | Compare the accuracy of a regular L2 SVM trained on the features selected by the L1 SVM with the accuracy of an L2 SVM trained on all the features (compute accuracy using 5-fold cross-validation). |
It has been argued in the literature that L1-SVMs often leads to solutions that are too sparse. As a workaround, implement the following strategy: | It has been argued in the literature that L1-SVMs often leads to solutions that are too sparse. As a workaround, implement the following strategy: | ||
Line 36: | Line 43: | ||
* Create $k$ sub-samples of the data in which you randomly choose 80% of the examples. | * Create $k$ sub-samples of the data in which you randomly choose 80% of the examples. | ||
* For each sub-sample train an L1-SVM. | * For each sub-sample train an L1-SVM. | ||
- | * For each feature compute a score that is the average weight vector | + | * For each feature compute a score that is the number of sub-samples for which that feature yielded a non-zero score. |
+ | |||
+ | |||
+ | ===== Part 3: Method comparison ===== | ||
+ | |||
+ | Compute the accuracy of a Linear L2 SVM as a function of the number of selected features on the leukemia and Arcene datasets for the following feature selection methods: | ||
+ | |||
+ | * The Golub score | ||
+ | * L1-SVM feature selection using subsamples | ||
+ | * RFE-SVM | ||
+ | |||
+ | Make sure that your evaluation provides an un-biased estimate of classifier performance. | ||
+ | Comment on the results. | ||
+ | For the above experiment you do not need to select the optimal value for the SVM soft-margin constant. | ||
+ | Compare these results to results obtained using internal cross-validation for selecting | ||
+ | the soft margin constant $C$ over a grid of values. | ||
+ | In writing your code, use scikit-learn's ability to combine analysis steps using the [[http://scikit-learn.org/stable/modules/pipeline.html |Pipeline class]]. This will be particularly useful for performing model selection. | ||
- | Do your results change if you do model selection for the resulting classifier over a grid of values for the soft margin constant $C$? | ||
Line 62: | Line 84: | ||
Grading sheet for assignment 3 | Grading sheet for assignment 3 | ||
- | Part 1: 40 points. | + | Part 1: 15 points. |
- | (10 points): Primal SVM formulation is correct | + | (15 points): Correct implementation of the Golub score |
- | ( 7 points): Lagrangian found correctly | + | |
- | ( 8 points): Derivation of saddle point equations | + | |
- | (10 points): Derivation of the dual | + | |
- | ( 5 points): Discussion of the implication of the form of the dual for SMO-like algorithms | + | |
- | Part 2: 10 points. | + | Part 2: 35 points. |
+ | (15 points): Comparison of L1 chosen features with use of all features. | ||
+ | (20 points): Correct implementation of L1-SVM feature selection using sub-samples. | ||
Part 3: 40 points. | Part 3: 40 points. | ||
- | (20 points): Accuracy as a function of parameters and discussion of the results | + | (25 points): Accuracy as a function of number of features and discussion of the results |
- | (15 points): Comparison of normalized and non-normalized kernels and correct model selection | + | (15 points): Same, with model selection |
- | ( 5 points): Visualization of the kernel matrix and observations made about it | + | |
Report structure, grammar and spelling: 10 points | Report structure, grammar and spelling: 10 points |