Warning: Declaration of action_plugin_tablewidth::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /s/bach/b/class/cs545/public_html/fall16/lib/plugins/tablewidth/action.php on line 93
assignments:assignment3 [CS545 fall 2016]

User Tools

Site Tools


assignments:assignment3

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
assignments:assignment3 [2016/08/09 10:25]
127.0.0.1 external edit
assignments:assignment3 [2016/09/20 09:34] (current)
asa [Part 1]
Line 1: Line 1:
-========= Assignment 3: Support Vector Machines ============+~~NOTOC~~
  
-Due:  October 16th at 11pm+====== Assignment 3 ======
  
-===== Part 1 SVM with no bias term =====+**Due:** 10/3 at 11:59pm.
  
-Formulate a soft-margin SVM without the bias term, i.e. one where the discriminant function is equal to $\mathbf{w}^{T} \mathbf{x}$. +===== Preliminaries =====
-Derive the saddle point conditions, KKT conditions and the dual. +
-Compare it to the standard SVM formulation that was derived in class. +
-In class we discussed SMO-type algorithms for optimizing the dual SVM.  At each step SMO optimizes two variables at a time, which is the smallest number possible. +
-Is this still the case for the formulation you have derived? ​ In other words, is two the smallest number of variables that can be optimized at a time? +
-Hint:  consider the difference in the constraints.+
  
-===== Part 2 Soft-margin SVM for separable ​data =====+In this assignment you will explore ridge regression applied to the task of predicting wine quality. 
 +You will use the [[http://​archive.ics.uci.edu/​ml/​datasets/​Wine+Quality | wine quality]] dataset from the UCI machine learning repository, and compare accuracy obtained using ridge regression to the results from a [[http://​www.sciencedirect.com/​science/​article/​pii/​S0167923609001377#​ | recent publication]] (if you have trouble accessing that version of the paper, here's a link to a [[http://​www3.dsi.uminho.pt/​pcortez/​wine5.pdf| preprint]]. 
 +The wine data is composed of two datasets ​one for white wines, and one for reds.  In this assignment perform all your analyses on just the red wine data.
  
-Consider training a soft-margin SVM  +The features for the wine dataset are not standardized,​ so make sure you do this, especially since we are going to consider ​the magnitude ​of the weight vector ​(recall that standardization entails subtracting ​the mean and then dividing by the standard deviation for each feature; you can use the [[http://​docs.scipy.org/​doc/​numpy/​reference/​routines.statistics.html | Numpy statistics module]] ​to perform the required calculations).
-with the soft margin constant $C$ set to some positive constant. Suppose the training data is linearly separable. +
-Since increasing the $\xi_i$ can only increase ​the objective ​of the primal problem ​(which +
-we are trying to minimize), at the optimal solution to the primal problem, all the +
-training examples will have $\xi_i$ equal +
-to zeroTrue or false? ​ Explain! +
-Given a linearly separable dataset, is it necessarily better to use a +
-a hard margin SVM over a soft-margin SVM?  Explain!+
  
-===== Part 3:  Using SVMs =====+==== Part ====
  
-The data for this question comes from database ​called ​SCOP (structural +Implement ridge regression in class called ​RidgeRegression that implements the classifier APIi.e''​fit''​ and ''​predict''​ methods with the same signature ​as the classifiers you implemented in the previous assignment.  ​Also implement functions ​for computing ​the following measures ​of error:
-classification of proteins)which classifies proteins into classes +
-according to their structure (download it from {{assignments:​scop_motif.data|here}})  +
-The data is a two-class classification +
-problem +
-of distinguishing a particular class of proteins from a selection of +
-examples sampled from the rest of the SCOP database +
-using features derived from their sequence (a protein is a chain of amino acids, so as computer scientists, we can consider it as a sequence over the alphabet of the 20 amino acids). +
-I chose to represent the proteins in terms of their motif composition.  ​A sequence motif is a +
-pattern of amino acids that is conserved in evolution. +
-Motifs are usually associated with regions of the protein that are +
-important ​for its function, and are therefore useful in differentiating between classes of proteins. +
-A given protein will typically contain only a handful of motifs, and +
-so the data is very sparse. +
-Therefore, only the non-zero elements ​of the data are represented. +
-Each line in the file describes a single example. ​ Here's an example from the file:+
  
-<​code>​ +  * The Root Mean Square Error (RMSE)
-d1scta_,​a.1.1.2 31417:1.0 32645:1.0 39208:1.0 42164:1.0 ...+  ​* ​The Maximum Absolute Deviation ​(MAD).
-</​code>​ +
-The first column is the ID of the protein, the second is the class it belongs to (the values for the class variable are ''​a.1.1.2'',​ which is the given class of proteins, and ''​rest''​ which is the negative class representing the rest of the database); the remainder consists of elements of the form ''​feature_id:​value''​which provide an id of a feature and the value associated with it. +
-This is an extension of the format used by LibSVM, that scikit-learn can read. +
-See a discussion of this format and how to read it [[http://​scikit-learn.org/​stable/​datasets/#​datasets-in-svmlight-libsvm-format | here]].+
  
-We note that the data is very high dimensional since 
-the number of conserved patterns in the space of all proteins is 
-large. 
-The data was constructed as part of the following analysis of detecting distant relationships between proteins: 
  
-  * A. Ben-Hur and D. Brutlag. [[http://​bioinformatics.oxfordjournals.org/​content/​19/​suppl_1/​i26.abstract | Remote homology detection: ​motif based approach]]. In: Proceedingseleventh international conference on intelligent systems for molecular biology. Bioinformatics 19(Suppl. 1)i26-i33, 2003.+For hypothesis $h$they are defined as follows:
  
-In this part of the assignment we will explore the dependence of classifier accuracy on  +$$RMSE(h) = \sqrt{\frac{1}{N}\sum_{i=1}^N (y_i h(\mathbf{x}_i))^2}$$
-the kernel, kernel parameters, kernel normalization,​ and the SVM soft-margin parameter. +
-In your implementation you can use the scikit-learn [[http://​scikit-learn.org/​stable/​modules/​generated/​sklearn.svm.SVC.html | svm]] class.+
  
-In this question we will consider both the Gaussian and polynomial kernels: 
-$$ 
-K_{gauss}(\mathbf{x},​ \mathbf{x'​}) = \exp(-\gamma || \mathbf{x} - \mathbf{x}'​ ||^2) 
-$$ 
 and and
-$$ 
-K_{poly}(\mathbf{x},​ \mathbf{x'​}) = (\mathbf{x}^T \mathbf{x}'​ + 1) ^{p}. 
-$$ 
  
-Plot the accuracy of the SVM, measured using the area under the ROC curve +$$MAD(h) = \frac{1}{N}\sum_{i=1}^N |y_i - h(\mathbf{x}_i)|.$$
-as a function of both the soft-margin parameter of the SVM, and the free parameter +
-of the kernel function. +
-Accuracy should be measured in five-fold cross-validation. +
-Show a couple of representative cross sections of this plot for a given value +
-of the soft margin parameter, and for a given value of the kernel parameter. +
-Comment on the results. ​ When exploring the values of a continuous +
-classifier/​kernel parameter it is +
-useful to use values that are distributed on an exponential grid, +
-i.e. something like 0.01, 0.1, 1, 10, 100 (note that the degree of the +
-polynomial kernel is not such a parameter).+
  
-Next, we will compare ​the accuracy of an SVM with a Gaussian kernel on the raw data with accuracy obtained when the data is normalized ​to be unit vectors (the values ​of the features ​of each example are divided by its norm)+With the code you just implemented,​ your next task is to explore ​the dependence ​of error on the value of the regularization parameter, $\lambda$
-This is different than standardization which operates at the level of individual features. ​ Normalizing to unit vectors is more appropriate for this dataset ​as it is sparsei.e. most of the features are zero. +In what follows set aside 30% of the data as a validation-setand compute ​the in-sample error, and the validation-set error as a function of the parameter $\lambda$ on the red wine data ​Choose the values of $\lambda$ on a logarithmic scale with values 0.010.1, 1, 10, 100, 1000 and plot the RMSE. 
-Perform your comparison by comparing the accuracy measured by the area under the ROC curve in five-fold cross validation. +Repeat the same experiment where instead of using all the training ​data, choose 20 random examples out of the training set, and train your model using those 20 examples, while evaluating on the same validation set.
-The optimal ​values of kernel parameters should be measured by cross-validationwhere the optimal SVM/kernel parameters are chosen ​using grid search on the training ​set of each fold. +
-Use the scikit-learn [[http://​scikit-learn.org/​stable/​tutorial/​statistical_inference/​model_selection.html +
- | grid-search]] class for model selection.+
  
 +Now answer the following:
  
-Finally, ​visualize ​the kernel matrix associated ​with the dataset. +  * What is the optimal value of $\lambda$?​ 
-Explain ​the structure that you are seeing ​in the plot (it is more +  * What observations can you make on the basis of these plots? ​ (The concepts of overfitting/​underfitting should be addressed in your answer). 
-interesting when the data is normalized).+  * Finally, ​compare ​the results that you are getting ​with the published results in the paper linked above In particular, is the performance ​you have obtained is comparable to that observed in the paper? 
 + 
 +==== Part 2 ==== 
 + 
 +Regression Error Characteristic (REC) curves ​are an interesting way of visualizing regression error as described 
 +in the following [[http://​machinelearning.wustl.edu/​mlpapers/​paper_files/​icml2003_BiB03.pdf|paper]]. 
 +Write a function that plots the REC curve of a regression method, and plot the REC curve of the best regressor you found in Part 1 of the assignment ​(i.e. the one that gave the lowest error on the validation set).  Plot the REC curve for both the validation set and the training set. 
 +What can you learn from this curve that you cannot learn from an error measure such as RMSE or MAD? 
 + 
 + 
 +==== Part 3 ==== 
 + 
 +As we discussed in class, the magnitude of the weight vector can be interpreted as a measure of feature importance. 
 +Train a ridge regression classifier on a subset of the dataset that you reserved for training. 
 +We will explore the relationship between the magnitude of weight vector components and their relevance to the classification task in several ways. 
 +Each feature ​is associated with a component of the weight vector. ​ It can also be associated with the correlation of that feature with the vector of labels. 
 +As we discussed in class, ​the magnitude of the weight vector can give an indication of feature relevance; another measure of relevance of a feature ​is its correlation with the labels. ​ To compare the two,  
 +create a scatter plot of weight vector components against the [[https://​en.wikipedia.org/​wiki/​Pearson_product-moment_correlation_coefficient | Pearson correlation coefficient]] of the corresponding feature with the labels (again, you can use the [[http://​docs.scipy.org/​doc/​numpy/​reference/​routines.statistics.html | Numpy statistics module]] to compute it)
 +What can you conclude from this plot? 
 +The paper ranks features according to their importance using a different approach. ​ Compare your results with what they obtain. 
 + 
 +Next, perform the following experiment:​ 
 +Incrementally remove the feature with the lowest absolute value of the weight vector and retrain the ridge regression classifier. 
 +Plot RMSE as a function of the number of features that remain on the test set which you have set aside and comment on the results.
  
 ===== Submission ===== ===== Submission =====
  
-Submit ​the pdf of your report via Canvas. ​ Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately The latex sample document shows how to display Python code in a latex document.  ​Code needs to be there so we can make sure that you implemented ​the algorithms and data analysis methodology correctly Canvas allows you to submit multiple files for an assignment, so DO NOT submit an archive file (tar, zip, etc).  ​Canvas will only allow you to submit pdfs (.pdf extension) or python ​code (.py extension). +Submit your report via Canvas. ​ Python code can be displayed in your report if it is short, and helps understand what you have done. The sample ​LaTex document ​provided in assignment 1 shows how to display Python code.  ​Submit the Python code that was used to generate the results as a file called ''​assignment3.py''​ (you can split the code into several ​.py files; ​Canvas allows you to submit multiple files).  ​Typing  
-For this assignment there is a strict 8 page limit (not including references and code that is provided as an appendix). ​ We will take off points for reports that go over the page limit. + 
-In addition to the code snippets that you include ​in your report, make sure you provide complete code from which we can see exactly how your results were generated.+<​code>​ 
 +python ​assignment3.py 
 +</code> 
 +should generate all the tables/​plots used in your report. ​  
 + 
 + 
  
 ===== Grading ===== ===== Grading =====
Line 103: Line 78:
 A few general guidelines for this and future assignments in the course: A few general guidelines for this and future assignments in the course:
  
-  * Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description ​(UNLESS the method has been provided in class or is there in the book).  ​Your code needs to be provided in sufficient detail so we can make sure that your implementation is correct. ​ The saying that "the devil is in the details"​ holds true for machine learning, and is sometimes the makes the difference between correct and incorrect results. ​ If your code is more than a few lines, you can include it as an appendix to your report, ​or submit it as a separate file Make sure your code is readable! +  ​* Your answers should be concise and to the point. ​  
-  * You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem.+  * You need to use LaTex to write the report. 
 +  * The report is well structured, the writing is clear, with good grammar and correct spelling; good formatting of math, code, figures and captions (every figure and table needs to have a caption that explains what is being shown). 
 +  * Whenever you use information from the web or published papers, a reference should be provided. ​ Failure to do so is considered plagiarism. 
 + 
 +We will take off points if these guidelines are not followed. 
 + 
 +  ​* Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description.  ​You can use a few lines of python code or pseudo-code
 +  * You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem.  There are no rules about how much space each answer should take.  BUT we will take off points if we have to wade through a lot of redundant data.
   * In any machine learning paper there is a discussion of the results. ​ There is a similar expectation from your assignments that you reason about your results. ​ For example, for the learning curve problem, what can you say on the basis of the observed learning curve?   * In any machine learning paper there is a discussion of the results. ​ There is a similar expectation from your assignments that you reason about your results. ​ For example, for the learning curve problem, what can you say on the basis of the observed learning curve?
-  * Write succinct answers. ​ We will take off points for rambling answers that are not to the point, and and similarly, if we have to wade through a lot of data/​results that are not to the point. 
  
 <​code>​ <​code>​
 Grading sheet for assignment 3 Grading sheet for assignment 3
  
-Part 1:  ​40 points. +Part 1:  ​50 points. 
-(10 points):  ​Primal SVM formulation ​is correct +(15 points):  ​Ridge regression ​is correctly ​implemented. 
-( 7 points): ​ Lagrangian found correctly +(15 points):  ​Plots of RMSE as a function ​of lambda are generated correctly. 
-points):  ​Derivation ​of saddle point equations +(20 points): ​ Discussion of the results
-(10 points): ​ Derivation ​of the dual +
-points): ​ Discussion of the implication of the form of the dual for SMO-like algorithms+
  
-Part 2:  ​10 points.+Part 2:  ​25 points. 
 +(15 points): ​ REC curves are generated correctly 
 +(10 points): ​ discussion of REC curves
  
-Part 3:  ​40 points. +Part 3:  ​25 points. 
-(20 points):  ​Accuracy as a function of parameters and discussion of the results +(20 points):  ​Weight vector analysis 
-(15 points): ​ Comparison ​of normalized and non-normalized kernels and correct model selection +points): ​ Comparison ​to the published results
-( 5 points): ​ Visualization of the kernel matrix and observations made about it+
  
-Report structure, grammar and spelling: ​ 10 points 
-(10 points): ​ Heading and subheading structure easy to follow and clearly divides report into logical sections.  ​ 
-              Code, math, figure captions, and all other aspects of the report are well-written and formatted. 
-              Grammar, spelling, and punctuation. ​ Answers are clear and to the point. 
 </​code>​ </​code>​
 +
 +
assignments/assignment3.1470759924.txt.gz · Last modified: 2016/09/15 12:53 (external edit)