Differences

This shows you the differences between two versions of the page.

--- assignments:assignment5 [2015/10/30 19:10]
asa
+++ assignments:assignment5 [2015/10/31 09:36]
asa
@@ Line 1: / Line 1: @@
-========= Assignment 5: Feature selection ============
+======== Assignment 5: Feature selection ===========
 Due:  November 15th at 11pm
-In this assignment we will compare several feature selection methods on several datasets.
-The datasets we will use are the yeast gene expression dataset
+===== Data =====
+In this assignment you will compare several feature selection methods on several datasets.
+The first dataset is the [[https://archive.ics.uci.edu/ml/datasets/Arcene| Arcene]] dataset which was used in the 2003 NIPS feature selection competition.  The dataset is produced by mass spectrometry of biological samples that comes from different types of cancer.
+The second dataset describes the expression of human genes in two types of leukemia The original publication that describes the data:
+T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander.
+[[https://www.broadinstitute.org/mpr/publications/projects/Leukemia/Golub_et_al_1999.pdf | Molecular classification of cancer: class discovery and class prediction by gene expression monitoring]].
+Science, 286(5439):531, 1999.
+Download a processed version of the dataset in libsvm format from the [[https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html | libsvm data repository]].  Look for the dataset named "leukemia".  There are two files, one a training set and another which contains a test set.  Merge the two files into a single file for your experiments.
 ===== Part 1:  Filter methods =====
@@ Line 15: / Line 26: @@
 where $\mu_i^{(+)}$ is the average of feature $i$ in the positive examples,
 where $\sigma_i^{(+)}$ is the standard deviation of feature $i$ in the positive examples, and $\mu_i^{(-)}, \sigma_i^{(-)}$ are defined analogously for the negative examples.
-In order for your function to work with the scikit-learn filter framework it needs to have two parameters: ''golub(X, y)'', where X is the feature matrix, and y is a vector of labels.  All scikit-learn filter methods return two values - a vector of scores, and a vector of p-values.  For our purposes, we won't use p-values associated with the Golub scores, so just return the computed vector of scores twice:  if your vector of scores is stored in an array called scores, have the return statement be:
+In order for your function to work with the scikit-learn filter framework it needs to have two parameters: ''golub(X, y)'', where X is the feature matrix, and y is a vector of labels.  All scikit-learn filter methods return two values - a vector of scores, and a vector of p-values.  For our purposes, we won't use p-values associated with the Golub scores, so just return the computed vector of scores twice (''return scores,scores'' if your vector of scores is stored in an array called scores)
-''return scores,scores''
 ===== Part 2:  Embedded methods:  L1 SVM =====
@@ Line 30: / Line 37: @@
 Compare the accuracy of an L1 SVM to an SVM that uses RFE to select relevant features.
-Compare the accuracy of a regular L2 SVM trained on those features with an L2 SVM trained on all the features computed using 5-fold cross-validation.
+Compare the accuracy of a regular L2 SVM trained on the features selected by the L1 SVM with the accuracy of an L2 SVM trained on all the features (compute accuracy using 5-fold cross-validation).
 It has been argued in the literature that L1-SVMs often leads to solutions that are too sparse.  As a workaround, implement the following strategy:
@@ Line 36: / Line 43: @@
   * Create $k$ sub-samples of the data in which you randomly choose 80% of the examples.
   * For each sub-sample train an L1-SVM.
-  * For each feature compute a score that is the average weight vector
+  * For each feature compute a score that is the number of sub-samples for which that feature yielded a non-zero score.
+===== Part 3:  Method comparison =====
+Compute the accuracy of a Linear L2 SVM as a function of the number of selected features on the leukemia and Arcene datasets for the following feature selection methods:
+  * The Golub score
+  * L1-SVM feature selection using subsamples
+  * RFE-SVM
+Make sure that your evaluation provides an un-biased estimate of classifier performance.
+Comment on the results.
+For the above experiment you do not need to select the optimal value for the SVM soft-margin constant.
+Compare these results to results obtained using internal cross-validation for selecting
+the soft margin constant $C$ over a grid of values.
+In writing your code, use scikit-learn's ability to combine analysis steps using the [[http://scikit-learn.org/stable/modules/pipeline.html |Pipeline class]].  This will be particularly useful for performing model selection.
-Do your results change if you do model selection for the resulting classifier over a grid of values for the soft margin constant $C$?
@@ Line 62: / Line 84: @@
 Grading sheet for assignment 3
-Part 1:  40 points.
+Part 1:  15 points.
-(10 points):  Primal SVM formulation is correct
+(15 points):  Correct implementation of the Golub score
-( 7 points):  Lagrangian found correctly
-( 8 points):  Derivation of saddle point equations
-(10 points):  Derivation of the dual
-( 5 points):  Discussion of the implication of the form of the dual for SMO-like algorithms
-Part 2:  10 points.
+Part 2:  35 points.
+(15 points):  Comparison of L1 chosen features with use of all features.
+(20 points):  Correct implementation of L1-SVM feature selection using sub-samples.
 Part 3:  40 points.
-(20 points):  Accuracy as a function of parameters and discussion of the results
+(25 points):  Accuracy as a function of number of features and discussion of the results
-(15 points):  Comparison of normalized and non-normalized kernels and correct model selection
+(15 points):  Same, with model selection
-( 5 points):  Visualization of the kernel matrix and observations made about it
 Report structure, grammar and spelling:  10 points

CS545 fall 2016

User Tools

Site Tools

Differences

Page Tools