model selection and cross validation in scikit-learn

First let's import some modules and read in some data:

In [1]: import numpy as np
In [2]: from sklearn import cross_validation
In [3]: from sklearn import svm
In [4]: from sklearn import metrics
In [5]: data=np.genfromtxt("../data/heart_scale.data", delimiter=",")
In [6]: X=data[:,1:]
In [7]: y=data[:,0]

The simplest form of model evaluation uses a validation/test set:

In [9]: X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
In [10]: classifier = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
In [11]: classifier.score(X_test, y_test)
Out[11]: 0.7592592592592593

Next, let'd perform cross-validation:

In [18]: cross_validation.cross_val_score(classifier, X, y, cv=5, scoring='accuracy')
Out[18]: array([ 0.7962963 ,  0.83333333,  0.88888889,  0.83333333,  0.83333333])
In [19]: 
In [19]: # you can obtain accuracy for other metrics, such as area under the roc curve:
In [20]: cross_validation.cross_val_score(classifier, X, y, cv=5, scoring='roc_auc')
Out[20]: array([ 0.89166667,  0.89166667,  0.95833333,  0.87638889,  0.91388889])
In [21]: 
In [21]: # you can also obtain the predictions by cross-validation and then compute the accuracy:
In [22]: y_predict = cross_validation.cross_val_predict(classifier, X, y, cv=5)
In [23]: metrics.accuracy_score(y, y_predict)
Out[23]: 0.83703703703703702

H ere's an alternative way of doing cross-validation.

In [25]: # first divide the data into folds:
In [26]: cv = cross_validation.StratifiedKFold(y, 5)
In [27]: # now use these folds:
In [28]: print cross_validation.cross_val_score(classifier, X, y, cv=cv, scoring='roc_auc')
[ 0.89166667  0.89166667  0.95833333  0.87638889  0.91388889]
In [29]: 
In [29]: # you can see how examples were divided into folds by looking at the test_folds attribute:
In [30]: print cv.test_folds
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4]
