Warning: Declaration of action_plugin_tablewidth::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /s/bach/b/class/cs545/public_html/fall16/lib/plugins/tablewidth/action.php on line 93
assignments:assignment3 [CS545 fall 2016]

User Tools

Site Tools


assignments:assignment3

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
assignments:assignment3 [2015/10/02 09:48]
asa
assignments:assignment3 [2015/10/02 12:42]
asa
Line 1: Line 1:
 ========= Assignment 3: Support Vector Machines ============ ========= Assignment 3: Support Vector Machines ============
  
-Due:  October ​20th at 6pm+Due:  October ​16th at 11pm
  
 ===== Part 1:  SVM with no bias term ===== ===== Part 1:  SVM with no bias term =====
Line 7: Line 7:
 Formulate a soft-margin SVM without the bias term, i.e. one where the discriminant function is equal to $\mathbf{w}^{T} \mathbf{x}$. Formulate a soft-margin SVM without the bias term, i.e. one where the discriminant function is equal to $\mathbf{w}^{T} \mathbf{x}$.
 Derive the saddle point conditions, KKT conditions and the dual. Derive the saddle point conditions, KKT conditions and the dual.
-Compare it to the standard SVM formulation. +Compare it to the standard SVM formulation ​that was derived in class
-As we discussed ​in class, ​SMO-type algorithms for the dual optimize the smallest number of variables at a time, which is two variables+In class we discussed SMO-type algorithms for optimizing ​the dual SVM.  At each step SMO optimizes two variables at a time, which is the smallest number possible
-Is this still the case for the formulation you have derived?+Is this still the case for the formulation you have derived?  In other words, is two the smallest number of variables that can be optimized at a time?
 Hint:  consider the difference in the constraints. Hint:  consider the difference in the constraints.
  
Line 32: Line 32:
 of distinguishing a particular class of proteins from a selection of of distinguishing a particular class of proteins from a selection of
 examples sampled from the rest of the SCOP database examples sampled from the rest of the SCOP database
-using features derived from their sequence (note that a protein is an arbitrary length ​sequence over the alphabet of the 20 amino acids). +using features derived from their sequence (a protein is a chain of amino acids, so as computer scientists, we can consider it as a sequence over the alphabet of the 20 amino acids). 
-I chose to represent the proteins in +I chose to represent the proteins in terms of their motif composition. ​ A sequence motif is a
-terms of their motif composition. ​ A sequence motif is a+
 pattern of amino acids that is conserved in evolution. pattern of amino acids that is conserved in evolution.
 Motifs are usually associated with regions of the protein that are Motifs are usually associated with regions of the protein that are
 important for its function, and are therefore useful in differentiating between classes of proteins. important for its function, and are therefore useful in differentiating between classes of proteins.
 A given protein will typically contain only a handful of motifs, and A given protein will typically contain only a handful of motifs, and
-so the data is very sparse.  ​It is also very high dimensionalsince+so the data is very sparse
 +Therefore, only the non-zero elements of the data are represented. 
 +Each line in the file describes a single example.  ​Here's an example from the file: 
 + 
 +<​code>​ 
 +d1scta_,​a.1.1.2 31417:1.0 32645:1.0 39208:1.0 42164:1.0 .... 
 +</​code>​ 
 +The first column is the ID of the protein, the second is the class it belongs to (the values for the class variable are ''​a.1.1.2'',​ which is the given class of proteins, and ''​rest''​ which is the negative class representing the rest of the database); the remainder consists of elements of the form ''​feature_id:​value''​which provide an id of a feature and the value associated with it. 
 +This is an extension of the format used by LibSVM, that scikit-learn can read. 
 +See a discussion of this format and how to read it [[http://​scikit-learn.org/​stable/​datasets/#​datasets-in-svmlight-libsvm-format | here]]. 
 + 
 +We note that the data is very high dimensional since
 the number of conserved patterns in the space of all proteins is the number of conserved patterns in the space of all proteins is
 large. large.
Line 54: Line 64:
 K_{gauss}(\mathbf{x},​ \mathbf{x'​}) = \exp(-\gamma || \mathbf{x} - \mathbf{x}'​ ||^2) K_{gauss}(\mathbf{x},​ \mathbf{x'​}) = \exp(-\gamma || \mathbf{x} - \mathbf{x}'​ ||^2)
 $$ $$
 +and
 $$ $$
-K_{poly}(\mathbf{x},​ \mathbf{x'​}) = (\mathbf{x}^T \mathbf{x}'​ + 1) ^{p}+K_{poly}(\mathbf{x},​ \mathbf{x'​}) = (\mathbf{x}^T \mathbf{x}'​ + 1) ^{p}.
 $$ $$
  
Line 84: Line 95:
 ===== Submission ===== ===== Submission =====
  
-Submit your report via Canvas. ​ Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately. ​ The latex sample document shows how to display Python code in a latex document. ​ Code needs to be there so we can make sure that you implemented the algorithms and data analysis methodology correctly. ​ Canvas allows you to submit multiple files for an assignment, so DO NOT submit an archive file (tar, zip, etc).+Submit ​the pdf of your report via Canvas. ​ Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately. ​ The latex sample document shows how to display Python code in a latex document. ​ Code needs to be there so we can make sure that you implemented the algorithms and data analysis methodology correctly. ​ Canvas allows you to submit multiple files for an assignment, so DO NOT submit an archive file (tar, zip, etc).  ​Canvas will only allow you to submit pdfs (.pdf extension) or python code (.py extension)
  
 ===== Grading ===== ===== Grading =====
Line 98: Line 109:
 Grading sheet for assignment 2 Grading sheet for assignment 2
  
-Part 1:  ​45 points.+Part 1:  ​40 points.
 (10 points): ​ Primal SVM formulation is correct (10 points): ​ Primal SVM formulation is correct
 (10 points): ​ Lagrangian found correctly (10 points): ​ Lagrangian found correctly
Line 105: Line 116:
 ( 5 points): ​ Discussion of the implication of the form of the dual for SMO-like algorithms ( 5 points): ​ Discussion of the implication of the form of the dual for SMO-like algorithms
  
-Part 2:  ​15 points.+Part 2:  ​10 points.
  
 Part 3:  40 points. Part 3:  40 points.
assignments/assignment3.txt · Last modified: 2016/09/20 09:34 by asa