Differences

This shows you the differences between two versions of the page.

--- assignments:assignment5 [2015/10/29 14:47]
asa
+++ assignments:assignment5 [2015/10/31 09:07]
asa
@@ Line 1: / Line 1: @@
-========= Assignment 5: Feature selection ============
+======== Assignment 5: Feature selection ===========
 Due:  November 15th at 11pm
-In this assignment we will compare several feature selection methods on several datasets.
-The datasets we will use are the yeast gene expression dataset
+==== Data ====
+In this assignment you will compare several feature selection methods on several datasets.
+The first dataset is the [[https://archive.ics.uci.edu/ml/datasets/Arcene| Arcene]] dataset which was used in the 2003 NIPS feature selection competition.  The dataset is produced by mass spectrometry of biological samples that comes from different types of cancer.
+The second dataset describes the expression of human genes in two types of leukemia The original publication that describes the data:
+T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander.
+[[https://www.broadinstitute.org/mpr/publications/projects/Leukemia/Golub_et_al_1999.pdf | Molecular classification of cancer: class discovery and class prediction by gene expression monitoring]].
+Science, 286(5439):531, 1999.
+Download a processed version of the dataset in libsvm format from the [[https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html | libsvm data repository]].  Look for the dataset named "leukemia".  There are two files, one a training set and another which contains a test set.  Merge the two files into a single file for your experiments.
 ===== Part 1:  Filter methods =====
@@ Line 25: / Line 36: @@
 The L1-SVM is an SVM that uses the L1 norm as the regularization term by replacing $w^Tw$ with $\sum_{i=1}^d |w_i|$.  As discussed in class, the L1 SVM leads to very sparse solutions, and can therefore be used to perform feature selection.
-Run the L1-SVM on the datasets mentioned above.  How many features have non-zero weight vector coefficients?  Compare the accuracy of a regular L2 SVM trained on those features with an L2 SVM trained on all the features computed using 5-fold cross-validation.
+Run the L1-SVM on the datasets mentioned above.
+In scikit-learn use ''LinearSVC(penalty='l1', dual=False)'' to create one.
+How many features have non-zero weight vector coefficients?  (Note that you can obtain the weight vector of a trained SVM by looking at its ''coef0'' attribute.
+Compare the accuracy of an L1 SVM to an SVM that uses RFE to select relevant features.
-L1-SVMs often leads to solutions that are too sparse.  As a workaround, implement the following strategy:
+Compare the accuracy of a regular L2 SVM trained on those features with an L2 SVM trained on all the features computed using 5-fold cross-validation.
+It has been argued in the literature that L1-SVMs often leads to solutions that are too sparse.  As a workaround, implement the following strategy:
   * Create $k$ sub-samples of the data in which you randomly choose 80% of the examples.
-  * For each sub-sample train an L1-SVM
+  * For each sub-sample train an L1-SVM.
+  * For each feature compute a score that is the average weight vector
 Do your results change if you do model selection for the resulting classifier over a grid of values for the soft margin constant $C$?
 ===== Submission =====

CS545 fall 2016

User Tools

Site Tools

Differences

Page Tools