Differences

This shows you the differences between two versions of the page.

--- assignments:assignment3 [2013/10/04 21:06]
asa
+++ assignments:assignment3 [2013/10/09 12:19]
asa
@@ Line 1: / Line 1: @@
 ========= Assignment 3: Support Vector Machines ============
+Due:  October 20th at 6pm
 ===== Part 1:  SVM with no bias term =====
@@ Line 10: / Line 12: @@
 Hint:  consider the difference in the constraints.
-Discuss the merit of the bias-less formulation as the dimensionality
+===== Part 2:  Closest Centroid Algorithm =====
-of the data (or the feature space) is varied.
-When using this SVM formulation it may be useful to add a constant to the
-kernel matrix.  Explain why this can be beneficial.
+Express the closest centroid algorithm in terms of kernels, i.e. determine how the coefficients $\alpha_i$ will be computed using a given labeled dataset.
-===== Part 3:  Using the SVM =====
+===== Part 3:  Soft-margin SVM for separable data =====
-Download the dataset associated with this assignment from the homework
+Consider training a soft-margin SVM
-page of the course.
+with $C$ set to some positive constant. Suppose the training data is linearly separable.
-In this assignment we will explore the dependence of classifier accuracy on
+Since increasing the $\xi_i$ can only increase the objective of the primal problem (which
-the kernel, kernel parameters, kernel normalization, and SVM parameter.
+we are trying to minimize), at the optimal solution to the primal problem, all the
-The use of the SVM class is discussed in the PyML [[http://pyml.sourceforge.net/tutorial.html#svms|tutorial]].
+training examples will have $\xi_i$ equal
+to zero. True or false?  Explain!
+Given a linearly separable dataset, is it necessarily better to use a
+a hard margin SVM over a soft-margin SVM?
-By default a dataset is instantiated with a linear kernel attached to it.
+===== Part 4:  Using SVMs =====
+The data for this question comes from a database called SCOP (structural
+classification of proteins), which classifies proteins into classes
+according to their structure (download it from {{assignments:scop_motif.data|here}}).
+The data is a two-class classification
+problem
+of distinguishing a particular class of proteins from a selection of
+examples sampled from the rest of the SCOP database.
+I chose to represent the proteins in
+terms of their motif composition.  A sequence motif is a
+pattern of nucleotides/amino acids that is conserved in evolution.
+Motifs are usually associated with regions of the protein that are
+important for its function, and are therefore useful in differentiating between classes of proteins.
+A given protein will typically contain only a handful of motifs, and
+so the data is very sparse.  It is also very high dimensional, since
+the number of conserved patterns in the space of all proteins is
+large.
+The data was constructed as part of the following analysis of detecting distant relationships between proteins:
+  * A. Ben-Hur and D. Brutlag. [[http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i26.abstract | Remote homology detection: a motif based approach]]. In: Proceedings, eleventh international conference on intelligent systems for molecular biology. Bioinformatics 19(Suppl. 1): i26-i33, 2003.
+In this part of the assignment we will explore the dependence of classifier accuracy on
+the kernel, kernel parameters, kernel normalization, and SVM parameter soft-margin parameter.
+The use of the SVM class is discussed in the PyML [[http://pyml.sourceforge.net/tutorial.html#svms|tutorial]], and by using help(SVM) in the python interpreter.
+By default, a dataset is instantiated with a linear kernel attached to it.
 To use a different kernel you need to attach a new kernel to the dataset:
 <code python>
@@ Line 35: / Line 64: @@
 >>> data.attachKernel(ker.Polynomial(degree = 3))
 </code>
+Alternatively, you can instantiate an SVM with a given kernel:
+<code python>
+>>> classifier = SVM(ker.Gaussian(gamma = 0.1))
+</code>
+This will override the kernel the data is associated with.
 In this question we will consider both the Gaussian and polynomial kernels:
 $$
-K_{gaus}(\mathbf{x}, \mathbf{x'} = \exp(-\gamma || \mathbf{x} - \mathbf{x}' ||^2)
+K_{gauss}(\mathbf{x}, \mathbf{x'}) = \exp(-\gamma || \mathbf{x} - \mathbf{x}' ||^2)
 $$
 $$
-K_{poly}(\mathbf{x}, \mathbf{x'} = (1 + \mathbf{x}^T \mathbf{x}') ^{p}
+K_{poly}(\mathbf{x}, \mathbf{x'}) = (1 + \mathbf{x}^T \mathbf{x}') ^{p}
 $$
-Plot the accuracy of the classifier, measured using the success rate and the area under the ROC curve
+Plot the accuracy of the SVM, measured using the balanced success rate
-as a function of both the ridge parameter of the classifier, and the free parameter
+as a function of both the soft-margin parameter of the SVM, and the free parameter
 of the kernel function.
 Show a couple of representative cross sections of this plot for a given value
@@ Line 49: / Line 84: @@
 Comment on the results.  When exploring the values of a continuous
 classifier/kernel parameter it is
-useful use values that are distributed on an exponential grid,
+useful to use values that are distributed on an exponential grid,
 i.e. something like 0.01, 0.1, 1, 10, 100 (note that the degree of the
 polynomial kernel is not such a parameter).
-The data for this question comes from a database called SCOP (structural
+For this type of sparse dataset it is useful to normalize the input features.
-classification of proteins), which classifies proteins into classes
-according to their structure.  The data is a two-class classification
-problem
-of distinguishing a particular class of proteins from a selection of
-examples sampled from the rest of the SCOP database.
-I chose to represent the proteins in
-terms of their motif composition.  A sequence motif is a
-pattern of nucleotides/amino acids that is conserved in evolution.
-Motifs are usually associated with regions of the protein that are
-important for its function, and are therefore useful in predicting protein
-function.
-A given protein will typically contain only a handful of motifs, and
-so the data is very sparse.  It is also very high dimensional, since
-the number of conserved patterns in the space of all proteins is
-large.
-More information about motifs and their usefulness in classifying
-proteins can be found in the following paper:
-  * A. Ben-Hur and D. Brutlag. Protein sequence motifs: Highly predictive features of protein function. In: Feature extraction, foundations and applications. I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh (eds.) Springer Verlag, 2006.
-For this type of sparse dataset it is useful to normalize the input features before
-training and testing your classifier.
 One way to do so is to divide each input example by its norm.  This is
 accomplished in PyML by:
 <code python>
-data.normalize()
+>>> data.normalize()
 </code>
 Compare the results under this normalization with what you obtain
@@ Line 92: / Line 105: @@
 interesting when the data is normalized).
+===== Submission =====
+Submit your report via RamCT.  Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately.  The latex sample document shows how to display Python code in a latex document.
+Also, please check-in a text file named README that describes what you found most difficult in completing this assignment (or provide that as a comment on ramct).
+===== Grading =====
+Here is what the grade sheet will look like for this assignment.  A few general guidelines for this and future assignments in the course:
+  * Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description.  You can use a few lines of python code or pseudo-code.  If your code is more than a few lines, you can include it as an appendix to your report.  For example, for the first part of the assignment, provide the protocol you use to evaluate classifier accuracy.
+  * You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem.  There are no rules about how much space each answer should take.  BUT we will take off points if we have to wade through a lot of redundant data.
+  * In any machine learning paper there is a discussion of the results.  There is a similar expectation from your assignments that you reason about your results.  For example, for the learning curve problem, what can you say on the basis of the observed learning curve?
+<code>
+Grading sheet for assignment 2
+Part 1:  30 points.
+(10 points):  Lagrangian found correctly
+( 5 points):  Derivation of saddle point equations
+(10 points):  Derivation of the dual
+( 5 points):  Discussion of the implication of the form of the dual for SMO-like algorithms
+Part 2:  15 points.
+Part 3:  15 points.
+Part 1:  40 points.
+(25 points):  Accuracy as a function of parameters and discussion of the results
+(10 points):  Comparison of normalized and non-normalized results
+( 5 points):  Visualization of the kernel matrix and observations made about it
+Report structure, grammar and spelling:  15 points
+( 5 points):  Heading and subheading structure easy to follow and
+              clearly divides report into logical sections.
+( 5 points):  Code, math, figure captions, and all other aspects of
+              report are well-written and formatted.
+( 5 points):  Grammar, spelling, and punctuation.
+</code>

CS545 fall 2016

User Tools

Site Tools

Differences

Page Tools