Differences

This shows you the differences between two versions of the page.

--- assignments:assignment3 [2015/10/02 09:42]
asa
+++ assignments:assignment3 [2015/10/02 12:05]
asa
@@ Line 32: / Line 32: @@
 of distinguishing a particular class of proteins from a selection of
 examples sampled from the rest of the SCOP database
-using features derived from their sequence (note that a protein is an arbitrary length sequence over the alphabet of the 20 amino acids).
+using features derived from their sequence (a protein is a chain of amino acids, so as computer scientists, we can consider it as a sequence over the alphabet of the 20 amino acids).
-I chose to represent the proteins in
+I chose to represent the proteins in terms of their motif composition.  A sequence motif is a
-terms of their motif composition.  A sequence motif is a
 pattern of amino acids that is conserved in evolution.
 Motifs are usually associated with regions of the protein that are
 important for its function, and are therefore useful in differentiating between classes of proteins.
 A given protein will typically contain only a handful of motifs, and
-so the data is very sparse.  It is also very high dimensional, since
+so the data is very sparse.
+Therefore, only the non-zero elements of the data are represented.
+Each line in the file describes a single example.  Here's an example from the file:
+<code>
+d1scta_,a.1.1.2 31417:1.0 32645:1.0 39208:1.0 42164:1.0 ....
+</code>
+The first column is the ID of the protein, the second is the class it belongs to (the values for the class variable are ''a.1.1.2'', which is the given class of proteins, and ''rest'' which is the negative class representing the rest of the database), and the rest of the elements are pairs of the form ''feature_id:value'' - an id of a feature and the value associated with it.
+This is an extension of the format used by LibSVM, that scikit-learn can read.
+See a discussion [[http://scikit-learn.org/stable/datasets/#datasets-in-svmlight-libsvm-format | here]].
+We note that the data is very high dimensional since
 the number of conserved patterns in the space of all proteins is
 large.
@@ Line 84: / Line 94: @@
 ===== Submission =====
-Submit your report via C.  Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately.  The latex sample document shows how to display Python code in a latex document.
+Submit your report via Canvas.  Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately.  The latex sample document shows how to display Python code in a latex document.  Code needs to be there so we can make sure that you implemented the algorithms and data analysis methodology correctly.  Canvas allows you to submit multiple files for an assignment, so DO NOT submit an archive file (tar, zip, etc).
-Also, please check-in a text file named README that describes what you found most difficult in completing this assignment (or provide that as a comment on ramct).
 ===== Grading =====
-Here is what the grade sheet will look like for this assignment.  A few general guidelines for this and future assignments in the course:
+A few general guidelines for this and future assignments in the course:
-  * Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description.  You can use a few lines of python code or pseudo-code.  If your code is more than a few lines, you can include it as an appendix to your report.  For example, for the first part of the assignment, provide the protocol you use to evaluate classifier accuracy.
+  * Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description (UNLESS the method has been provided in class or is there in the book).  Your code needs to be provided in sufficient detail so we can make sure that your implementation is correct.  The saying that "the devil is in the details" holds true for machine learning, and is sometimes the makes the difference between correct and incorrect results.  If your code is more than a few lines, you can include it as an appendix to your report, or submit it as a separate file.  Make sure your code is readable!
-  * You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem.  There are no rules about how much space each answer should take.  BUT we will take off points if we have to wade through a lot of redundant data.
+  * You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem.
   * In any machine learning paper there is a discussion of the results.  There is a similar expectation from your assignments that you reason about your results.  For example, for the learning curve problem, what can you say on the basis of the observed learning curve?
+  * Write succinct answers.  We will take off points for rambling answers that are not to the point, and and similarly, if we have to wade through a lot of data/results that are not to the point.
 <code>
 Grading sheet for assignment 2
-Part 1:  30 points.
+Part 1:  45 points.
+(10 points):  Primal SVM formulation is correct
 (10 points):  Lagrangian found correctly
-( 5 points):  Derivation of saddle point equations
+(10 points):  Derivation of saddle point equations
 (10 points):  Derivation of the dual
 ( 5 points):  Discussion of the implication of the form of the dual for SMO-like algorithms
@@ Line 106: / Line 117: @@
 Part 2:  15 points.
-Part 3:  15 points.
+Part 3:  40 points.
+(20 points):  Accuracy as a function of parameters and discussion of the results
-Part 1:  40 points.
+(15 points):  Comparison of normalized and non-normalized kernels and correct model selection
-(25 points):  Accuracy as a function of parameters and discussion of the results
-(10 points):  Comparison of normalized and non-normalized results
 ( 5 points):  Visualization of the kernel matrix and observations made about it
-Report structure, grammar and spelling:  15 points
+Report structure, grammar and spelling:  10 points
-( 5 points):  Heading and subheading structure easy to follow and
+(10 points):  Heading and subheading structure easy to follow and clearly divides report into logical sections.
-              clearly divides report into logical sections.
+              Code, math, figure captions, and all other aspects of the report are well-written and formatted.
-( 5 points):  Code, math, figure captions, and all other aspects of
+              Grammar, spelling, and punctuation.
-              report are well-written and formatted.
-( 5 points):  Grammar, spelling, and punctuation.
 </code>

CS545 fall 2016

User Tools

Site Tools

Differences

Page Tools