<br />
<b>Warning</b>:  Declaration of action_plugin_wrap::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in <b>/s/bach/b/class/cs545/public_html/fall13/dokuwiki/lib/plugins/wrap/action.php</b> on line <b>148</b><br />
<br />
<b>Warning</b>:  Declaration of action_plugin_tablewidth::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in <b>/s/bach/b/class/cs545/public_html/fall13/dokuwiki/lib/plugins/tablewidth/action.php</b> on line <b>93</b><br />
===== Hands on work in PyML and evaluating classifier performance =====

Let's start using our perceptron on some data and see how it's doing.

=== Reading in the data ===

First we need to import PyML:

<code python>
In [1]: from PyML import *
</code>

Next let's load the gisette dataset that you are going to use in assignment 1:

<code python>
In [2]: data = vectorDatasets.PyVectorDataSet("gisette_train.data")
</code>

This loads the feature matrix, and creates an unlabeled dataset, because the labels for this dataset are provided separately.
To attach labels to the data we read the labels:

<code python>
In [3]:  data.attachLabels(Labels("gisette_train.labels"))
</code>
Note that ''PyML'' has several data containers ''VectorDataSet'' and ''SparseDataSet'' which have an underlying C++ implementation, and ''PyVectorDataSet'', which uses a Numpy array.  We are using ''PyVectorDataSet'' in this case since our perceptron is implemented in pure python.

Let's find out a few things about the dataset:

<code python>
In [4]: print data
<PyVectorDataSet instance>
number of patterns: 6000
number of features: 5000
class Label  /  Size 
 -1 : 3000
 1 : 3000
</code>
This tells us there are 6000 labeled examples in the dataset, and how many there are of each class, as well as the dimensionality of the data (number of features).

You can access this information directly:
<code python>
In [4]: print len(tr), tr.numFeatures
3000 5000
</code>

The labels are stored in a ''Labels'' object that is associated with the dataset:
<code python>
In [5]: print data.labels
class Label  /  Size 
 -1 : 3000
 1 : 3000
</code>
The labels themselves are stored as a list in the ''Y'' attribute of the labels object, such that the label associated with the ith training example is  ''data.labels.Y[i]''.

As noted above, ''PyVectorDataset'' uses a numpy array to store its data and this array is accessible as the ''X'' attribute of a dataset:

<code python>
In [6]: print type(data.X), data.X.shape
<type 'numpy.ndarray'> (6000, 5000)
</code>

Let's split the dataset into two parts, one for training and one for testing:
<code python>
In [7]: tr, tst = data.split(0.5)
</code>
The argument to split indicates what fraction of the data to use for the first dataset.

=== Using the classifier ====

Importing and instantiating an instance of the perceptron:

<code python>
In [6]: import perceptron

In [7]: p = perceptron.Perceptron()
</code>

Every classifier has a ''train'' method that constructs the model based on some training data:

<code python>
In [8]: p.train(tr)
converged in 10 iterations 
</code>

Notice that the perceptron has converged.  That means the data is linearly separable.
Now let's run the classifier on the data we have used for training:

<code python>
In [9]: results1 = p.test(tr)

In [10]: print results1
Confusion Matrix:
      predicted labels:
       -1   1  
   -1 1500  0  
    1  0   1500
success rate: 1.000000
balanced success rate: 1.000000
area under ROC curve: 1.000000
area under ROC 50 curve: 0.980000 
</code>
Since the perceptron has converged, it perfectly separates the positive from negative examples, and achieves perfect classification accuracy.  Does this mean we have a good classifier?  Not necessarily.  Let's apply it to our testing data.

<code python>
In [12]: results = p.test(tst)

In [13]: print results
Confusion Matrix:
      predicted labels:
       -1   1  
   -1 1439  61 
    1  62  1438
success rate: 0.959000
balanced success rate: 0.959000
area under ROC curve: 0.992884
area under ROC 50 curve: 0.852040
</code>

The classifier is still doing well, but definitely not perfect!