Differences

This shows you the differences between two versions of the page.

--- assignments:assignment3 [2013/10/04 15:32]
asa created
+++ assignments:assignment3 [2013/10/06 11:54]
asa
@@ Line 3: / Line 3: @@
 ===== Part 1:  SVM with no bias term =====
-Formulate a soft-margin SVM without the bias term, i.e. $f(\x) = \w^{\tr} \x$.
+Formulate a soft-margin SVM without the bias term, i.e. $f(\mathbf{x}) = \mathbf{w}^{T} \mathbf{x}$.
 Derive the saddle point conditions, KKT conditions and the dual.
 Compare it to the standard SVM formulation.
@@ Line 10: / Line 10: @@
 Hint:  consider the difference in the constraints.
-Discuss the merit of the bias-less formulation as the dimensionality
+===== Part 2:  Closest Centroid Algorithm =====
-of the data (or the feature space) is varied.
-When using this SVM formulation it may be useful to add a constant to the
+Express the closest centroid algorithm in terms of kernels, i.e. determine how the coefficients $\alpha_i$ will be determined using a given labeled dataset.
-kernel matrix.  Explain why this can be beneficial.
+===== Part 3:  Using SVMs =====
+The data for this question comes from a database called SCOP (structural
+classification of proteins), which classifies proteins into classes
+according to their structure (download it from {{assignments:scop_motif.data|here}}).
+The data is a two-class classification
+problem
+of distinguishing a particular class of proteins from a selection of
+examples sampled from the rest of the SCOP database.
+I chose to represent the proteins in
+terms of their motif composition.  A sequence motif is a
+pattern of nucleotides/amino acids that is conserved in evolution.
+Motifs are usually associated with regions of the protein that are
+important for its function, and are therefore useful in differentiating between classes of proteins.
+A given protein will typically contain only a handful of motifs, and
+so the data is very sparse.  It is also very high dimensional, since
+the number of conserved patterns in the space of all proteins is
+large.
+The data was constructed as part of the following analysis of detecting distant relationships between proteins:
+  * A. Ben-Hur and D. Brutlag. [[http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i26.abstract | Remote homology detection: a motif based approach]]. In: Proceedings, eleventh international conference on intelligent systems for molecular biology. Bioinformatics 19(Suppl. 1): i26-i33, 2003.
+Download the dataset associated with this assignment from the homework
+page of the course.
+In this assignment we will explore the dependence of classifier accuracy on
+the kernel, kernel parameters, kernel normalization, and SVM parameter.
+The use of the SVM class is discussed in the PyML [[http://pyml.sourceforge.net/tutorial.html#svms|tutorial]].
+By default a dataset is instantiated with a linear kernel attached to it.
+To use a different kernel you need to attach a new kernel to the dataset:
+<code python>
+>>> from PyML import ker
+>>> data.attachKernel(ker.Gaussian(gamma = 0.1))
+</code>
+or
+<code python>
+>>> from PyML import her
+>>> data.attachKernel(ker.Polynomial(degree = 3))
+</code>
+In this question we will consider both the Gaussian and polynomial kernels:
+$$
+K_{gaus}(\mathbf{x}, \mathbf{x'}) = \exp(-\gamma || \mathbf{x} - \mathbf{x}' ||^2)
+$$
+$$
+K_{poly}(\mathbf{x}, \mathbf{x'}) = (1 + \mathbf{x}^T \mathbf{x}') ^{p}
+$$
+Plot the accuracy of the classifier, measured using the success rate and the area under the ROC curve
+as a function of both the ridge parameter of the classifier, and the free parameter
+of the kernel function.
+Show a couple of representative cross sections of this plot for a given value
+of the ridge parameter, and for a given value of the kernel parameter.
+Comment on the results.  When exploring the values of a continuous
+classifier/kernel parameter it is
+useful use values that are distributed on an exponential grid,
+i.e. something like 0.01, 0.1, 1, 10, 100 (note that the degree of the
+polynomial kernel is not such a parameter).
+For this type of sparse dataset it is useful to normalize the input features before
+training and testing your classifier.
+One way to do so is to divide each input example by its norm.  This is
+accomplished in PyML by:
+<code python>
+>>> data.normalize()
+</code>
+Compare the results under this normalization with what you obtain
+without normalization.
+You can visualize the whole kernel matrix associated with the data using the following commands:
+<code python>
+>>> from PyML import ker
+>>> ker.showKernel(data)
+</code>
+Explain the structure that you are seeing in the plot (it is more
+interesting when the data is normalized).

CS545 fall 2016

User Tools

Site Tools

Differences

Page Tools