This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
assignments:assignment3 [2013/10/06 11:54] asa |
assignments:assignment3 [2013/10/09 20:38] asa [Part 4: Using SVMs] |
||
---|---|---|---|
Line 1: | Line 1: | ||
========= Assignment 3: Support Vector Machines ============ | ========= Assignment 3: Support Vector Machines ============ | ||
+ | |||
+ | Due: October 20th at 6pm | ||
===== Part 1: SVM with no bias term ===== | ===== Part 1: SVM with no bias term ===== | ||
Line 12: | Line 14: | ||
===== Part 2: Closest Centroid Algorithm ===== | ===== Part 2: Closest Centroid Algorithm ===== | ||
- | Express the closest centroid algorithm in terms of kernels, i.e. determine how the $alpha_i$ coefficients will be determined using a given labeled dataset. | + | Express the closest centroid algorithm in terms of kernels, i.e. determine how the coefficients $\alpha_i$ will be computed using a given labeled dataset. |
- | ===== Part 3: Using SVMs ===== | + | ===== Part 3: Soft-margin SVM for separable data ===== |
+ | |||
+ | Consider training a soft-margin SVM | ||
+ | with $C$ set to some positive constant. Suppose the training data is linearly separable. | ||
+ | Since increasing the $\xi_i$ can only increase the objective of the primal problem (which | ||
+ | we are trying to minimize), at the optimal solution to the primal problem, all the | ||
+ | training examples will have $\xi_i$ equal | ||
+ | to zero. True or false? Explain! | ||
+ | Given a linearly separable dataset, is it necessarily better to use a | ||
+ | a hard margin SVM over a soft-margin SVM? | ||
+ | |||
+ | ===== Part 4: Using SVMs ===== | ||
The data for this question comes from a database called SCOP (structural | The data for this question comes from a database called SCOP (structural | ||
classification of proteins), which classifies proteins into classes | classification of proteins), which classifies proteins into classes | ||
- | according to their structure (download it from {{assignments:scop_motif.data|here}}. | + | according to their structure (download it from {{assignments:scop_motif.data|here}}). |
The data is a two-class classification | The data is a two-class classification | ||
problem | problem | ||
Line 36: | Line 49: | ||
* A. Ben-Hur and D. Brutlag. [[http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i26.abstract | Remote homology detection: a motif based approach]]. In: Proceedings, eleventh international conference on intelligent systems for molecular biology. Bioinformatics 19(Suppl. 1): i26-i33, 2003. | * A. Ben-Hur and D. Brutlag. [[http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i26.abstract | Remote homology detection: a motif based approach]]. In: Proceedings, eleventh international conference on intelligent systems for molecular biology. Bioinformatics 19(Suppl. 1): i26-i33, 2003. | ||
- | Download the dataset associated with this assignment from the homework | + | In this part of the assignment we will explore the dependence of classifier accuracy on |
- | page of the course. | + | the kernel, kernel parameters, kernel normalization, and SVM parameter soft-margin parameter. |
- | In this assignment we will explore the dependence of classifier accuracy on | + | The use of the SVM class is discussed in the PyML [[http://pyml.sourceforge.net/tutorial.html#svms|tutorial]], and by using help(SVM) in the python interpreter. |
- | the kernel, kernel parameters, kernel normalization, and SVM parameter. | + | |
- | The use of the SVM class is discussed in the PyML [[http://pyml.sourceforge.net/tutorial.html#svms|tutorial]]. | + | |
- | By default a dataset is instantiated with a linear kernel attached to it. | + | By default, a dataset is instantiated with a linear kernel attached to it. |
To use a different kernel you need to attach a new kernel to the dataset: | To use a different kernel you need to attach a new kernel to the dataset: | ||
<code python> | <code python> | ||
Line 53: | Line 64: | ||
>>> data.attachKernel(ker.Polynomial(degree = 3)) | >>> data.attachKernel(ker.Polynomial(degree = 3)) | ||
</code> | </code> | ||
+ | Alternatively, you can instantiate an SVM with a given kernel: | ||
+ | <code python> | ||
+ | >>> classifier = SVM(ker.Gaussian(gamma = 0.1)) | ||
+ | </code> | ||
+ | This will override the kernel the data is associated with. | ||
+ | |||
In this question we will consider both the Gaussian and polynomial kernels: | In this question we will consider both the Gaussian and polynomial kernels: | ||
$$ | $$ | ||
- | K_{gaus}(\mathbf{x}, \mathbf{x'}) = \exp(-\gamma || \mathbf{x} - \mathbf{x}' ||^2) | + | K_{gauss}(\mathbf{x}, \mathbf{x'}) = \exp(-\gamma || \mathbf{x} - \mathbf{x}' ||^2) |
$$ | $$ | ||
$$ | $$ | ||
K_{poly}(\mathbf{x}, \mathbf{x'}) = (1 + \mathbf{x}^T \mathbf{x}') ^{p} | K_{poly}(\mathbf{x}, \mathbf{x'}) = (1 + \mathbf{x}^T \mathbf{x}') ^{p} | ||
$$ | $$ | ||
- | Plot the accuracy of the classifier, measured using the success rate and the area under the ROC curve | + | Plot the accuracy of the SVM, measured using the balanced success rate |
- | as a function of both the ridge parameter of the classifier, and the free parameter | + | as a function of both the soft-margin parameter of the SVM, and the free parameter |
of the kernel function. | of the kernel function. | ||
Show a couple of representative cross sections of this plot for a given value | Show a couple of representative cross sections of this plot for a given value | ||
- | of the ridge parameter, and for a given value of the kernel parameter. | + | of the soft margin parameter, and for a given value of the kernel parameter. |
Comment on the results. When exploring the values of a continuous | Comment on the results. When exploring the values of a continuous | ||
classifier/kernel parameter it is | classifier/kernel parameter it is | ||
- | useful use values that are distributed on an exponential grid, | + | useful to use values that are distributed on an exponential grid, |
i.e. something like 0.01, 0.1, 1, 10, 100 (note that the degree of the | i.e. something like 0.01, 0.1, 1, 10, 100 (note that the degree of the | ||
polynomial kernel is not such a parameter). | polynomial kernel is not such a parameter). | ||
- | + | For this type of sparse dataset it is useful to normalize the input features. | |
- | For this type of sparse dataset it is useful to normalize the input features before | + | |
- | training and testing your classifier. | + | |
One way to do so is to divide each input example by its norm. This is | One way to do so is to divide each input example by its norm. This is | ||
accomplished in PyML by: | accomplished in PyML by: | ||
Line 89: | Line 104: | ||
Explain the structure that you are seeing in the plot (it is more | Explain the structure that you are seeing in the plot (it is more | ||
interesting when the data is normalized). | interesting when the data is normalized). | ||
+ | |||
+ | ===== Submission ===== | ||
+ | |||
+ | Submit your report via RamCT. Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately. The latex sample document shows how to display Python code in a latex document. | ||
+ | Also, please check-in a text file named README that describes what you found most difficult in completing this assignment (or provide that as a comment on ramct). | ||
+ | |||
+ | ===== Grading ===== | ||
+ | |||
+ | Here is what the grade sheet will look like for this assignment. A few general guidelines for this and future assignments in the course: | ||
+ | |||
+ | * Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description. You can use a few lines of python code or pseudo-code. If your code is more than a few lines, you can include it as an appendix to your report. For example, for the first part of the assignment, provide the protocol you use to evaluate classifier accuracy. | ||
+ | * You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem. There are no rules about how much space each answer should take. BUT we will take off points if we have to wade through a lot of redundant data. | ||
+ | * In any machine learning paper there is a discussion of the results. There is a similar expectation from your assignments that you reason about your results. For example, for the learning curve problem, what can you say on the basis of the observed learning curve? | ||
+ | |||
+ | <code> | ||
+ | Grading sheet for assignment 2 | ||
+ | |||
+ | Part 1: 30 points. | ||
+ | (10 points): Lagrangian found correctly | ||
+ | ( 5 points): Derivation of saddle point equations | ||
+ | (10 points): Derivation of the dual | ||
+ | ( 5 points): Discussion of the implication of the form of the dual for SMO-like algorithms | ||
+ | |||
+ | Part 2: 15 points. | ||
+ | |||
+ | Part 3: 15 points. | ||
+ | |||
+ | Part 1: 40 points. | ||
+ | (25 points): Accuracy as a function of parameters and discussion of the results | ||
+ | (10 points): Comparison of normalized and non-normalized results | ||
+ | ( 5 points): Visualization of the kernel matrix and observations made about it | ||
+ | |||
+ | Report structure, grammar and spelling: 15 points | ||
+ | ( 5 points): Heading and subheading structure easy to follow and | ||
+ | clearly divides report into logical sections. | ||
+ | ( 5 points): Code, math, figure captions, and all other aspects of | ||
+ | report are well-written and formatted. | ||
+ | ( 5 points): Grammar, spelling, and punctuation. | ||
+ | </code> |