Differences

This shows you the differences between two versions of the page.

--- assignments:assignment3 [2015/10/02 09:48]
asa
+++ assignments:assignment3 [2015/10/02 12:05]
asa
@@ Line 32: / Line 32: @@
 of distinguishing a particular class of proteins from a selection of
 examples sampled from the rest of the SCOP database
-using features derived from their sequence (note that a protein is an arbitrary length sequence over the alphabet of the 20 amino acids).
+using features derived from their sequence (a protein is a chain of amino acids, so as computer scientists, we can consider it as a sequence over the alphabet of the 20 amino acids).
-I chose to represent the proteins in
+I chose to represent the proteins in terms of their motif composition.  A sequence motif is a
-terms of their motif composition.  A sequence motif is a
 pattern of amino acids that is conserved in evolution.
 Motifs are usually associated with regions of the protein that are
 important for its function, and are therefore useful in differentiating between classes of proteins.
 A given protein will typically contain only a handful of motifs, and
-so the data is very sparse.  It is also very high dimensional, since
+so the data is very sparse.
+Therefore, only the non-zero elements of the data are represented.
+Each line in the file describes a single example.  Here's an example from the file:
+<code>
+d1scta_,a.1.1.2 31417:1.0 32645:1.0 39208:1.0 42164:1.0 ....
+</code>
+The first column is the ID of the protein, the second is the class it belongs to (the values for the class variable are ''a.1.1.2'', which is the given class of proteins, and ''rest'' which is the negative class representing the rest of the database), and the rest of the elements are pairs of the form ''feature_id:value'' - an id of a feature and the value associated with it.
+This is an extension of the format used by LibSVM, that scikit-learn can read.
+See a discussion [[http://scikit-learn.org/stable/datasets/#datasets-in-svmlight-libsvm-format | here]].
+We note that the data is very high dimensional since
 the number of conserved patterns in the space of all proteins is
 large.

CS545 fall 2016

User Tools

Site Tools

Differences

Page Tools