Warning: Declaration of action_plugin_tablewidth::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /s/bach/b/class/cs545/public_html/fall16/lib/plugins/tablewidth/action.php on line 93
code:model_selection [CS545 fall 2016]

User Tools

Site Tools


code:model_selection

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
code:model_selection [2015/10/05 13:25]
asa created
code:model_selection [2015/10/05 15:01]
asa
Line 20: Line 20:
  
 </​code>​ </​code>​
 +
 +The simplest form of model evaluation uses a validation/​test set:
 +
 +<code python>
 +In [9]: X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,​ y, test_size=0.4,​ random_state=0)
 +
 +In [10]: classifier = svm.SVC(kernel='​linear',​ C=1).fit(X_train,​ y_train)
 +
 +In [11]: classifier.score(X_test,​ y_test)
 +Out[11]: 0.7592592592592593
 +
 +
 +</​code>​
 +
 +Next, let'd perform cross-validation:​
 +
 +<code python>
 +
 +In [18]: cross_validation.cross_val_score(classifier,​ X, y, cv=5, scoring='​accuracy'​)
 +Out[18]: array([ 0.7962963 ,  0.83333333, ​ 0.88888889, ​ 0.83333333, ​ 0.83333333])
 +
 +In [19]: 
 +
 +In [19]: # you can obtain accuracy for other metrics, such as area under the roc curve:
 +
 +In [20]: cross_validation.cross_val_score(classifier,​ X, y, cv=5, scoring='​roc_auc'​)
 +Out[20]: array([ 0.89166667, ​ 0.89166667, ​ 0.95833333, ​ 0.87638889, ​ 0.91388889])
 +
 +In [21]: 
 +
 +In [21]: # you can also obtain the predictions by cross-validation and then compute the accuracy:
 +
 +In [22]: y_predict = cross_validation.cross_val_predict(classifier,​ X, y, cv=5)
 +
 +In [23]: metrics.accuracy_score(y,​ y_predict)
 +Out[23]: 0.83703703703703702
 +
 +</​code>​
 +
 +H ere's an alternative way of doing cross-validation.
 +
 +<code python>
 +In [25]: # first divide the data into folds:
 +
 +In [26]: cv = cross_validation.StratifiedKFold(y,​ 5)
 +
 +In [27]: # now use these folds:
 +
 +In [28]: print cross_validation.cross_val_score(classifier,​ X, y, cv=cv, scoring='​roc_auc'​)
 +[ 0.89166667 ​ 0.89166667 ​ 0.95833333 ​ 0.87638889 ​ 0.91388889]
 +
 +In [29]: # you can see how examples were divided into folds by looking at the test_folds attribute:
 +
 +In [30]: print cv.test_folds
 +[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 + 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
 + 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 + 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 + 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4
 + 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 + 4 4 4 4 4 4 4 4 4 4 4]
 +
 +In [31]: # hmm... perhaps we should shuffle things a bit...
 +
 +In [32]: cv = cross_validation.StratifiedKFold(y,​ 5, shuffle=True)
 +
 +In [33]: print cv.test_folds
 +[0 1 1 2 0 1 4 3 4 3 2 0 2 3 2 3 2 0 4 1 1 3 4 1 1 4 1 4 4 2 2 3 0 2 3 1 4
 + 0 3 2 0 2 0 1 3 2 0 0 2 3 0 4 2 0 4 3 4 1 1 0 3 2 4 3 2 3 1 1 1 1 4 3 1 1
 + 4 2 2 3 3 1 4 2 1 0 2 1 0 2 4 1 0 3 2 3 1 2 2 1 1 0 4 1 3 0 1 1 3 3 0 3 3
 + 4 2 0 2 0 2 4 0 1 0 4 4 1 1 0 4 0 1 4 4 3 1 3 3 2 4 3 4 2 4 3 4 1 4 2 0 3
 + 3 3 3 0 0 0 4 3 4 2 3 0 1 1 0 0 4 0 4 1 4 0 0 0 0 3 3 0 4 4 2 0 3 3 0 1 2
 + 2 2 3 2 1 3 4 4 4 1 1 4 2 1 0 3 1 2 0 0 0 0 2 3 4 3 2 0 0 4 1 3 2 2 0 1 2
 + 4 2 4 0 2 1 1 0 4 4 1 4 4 3 4 2 3 3 1 4 2 1 4 1 3 2 1 3 2 1 3 1 3 0 2 2 0
 + 4 4 2 2 4 3 3 0 2 0 2]
 +
 +In [34]: # if you run division into folds multiple times you will get a different answer:
 +
 +In [35]: cv = cross_validation.StratifiedKFold(y,​ 5, shuffle=True)
 +
 +In [36]: print cv.test_folds
 +[3 0 2 2 0 2 2 4 1 4 0 2 3 4 2 0 4 0 3 3 4 0 2 0 4 4 0 1 4 4 3 4 1 2 3 3 1
 + 2 1 4 4 4 0 0 4 2 0 0 2 0 1 3 1 0 3 4 0 3 0 4 1 1 2 4 2 0 2 3 1 0 3 0 1 2
 + 3 2 4 0 0 0 1 4 3 2 2 4 3 1 3 2 0 2 0 0 3 2 1 2 4 4 0 0 4 2 1 4 3 0 4 3 4
 + 1 4 0 0 4 2 1 4 4 3 4 1 1 3 0 2 2 3 1 2 3 1 0 4 1 4 1 3 1 3 3 4 4 1 0 0 0
 + 0 4 3 1 2 2 3 0 3 2 4 3 2 2 3 0 3 1 0 4 2 3 0 2 4 3 0 4 3 4 3 3 0 3 1 2 2
 + 1 3 4 1 0 4 3 4 0 0 0 3 2 2 1 3 4 4 2 3 4 3 2 1 3 0 4 0 1 3 1 2 2 2 2 0 3
 + 1 1 1 2 0 1 4 1 1 1 2 2 1 2 3 3 1 4 4 3 4 2 0 2 2 1 1 1 2 0 3 0 2 1 1 3 1
 + 3 1 0 1 3 4 4 2 1 1 1]
 +
 +In [37]: # if you want to consistently get the same division into folds:
 +
 +In [38]: cv = cross_validation.StratifiedKFold(y,​ 5, shuffle=True,​ random_state=0)
 +
 +In [39]: # this sets the seed for the random number generator.
 +
 +</​code>​
 +
 +Let's do grid search for the optimal set of parameters:
 +
 +<code python>
 +In [40]: from sklearn.grid_search import GridSearchCV
 +
 +In [41]: Cs = np.logspace(-2,​ 3, 6)
 +
 +In [42]: classifier = GridSearchCV(estimator=svm.LinearSVC(),​ param_grid=dict(C=Cs) )
 +
 +In [43]: classifier.fit(X,​ y)
 +Out[43]: ​
 +GridSearchCV(cv=None,​ error_score='​raise',​
 +       ​estimator=LinearSVC(C=1.0,​ class_weight=None,​ dual=True, fit_intercept=True,​
 +     ​intercept_scaling=1,​ loss='​squared_hinge',​ max_iter=1000,​
 +     ​multi_class='​ovr',​ penalty='​l2',​ random_state=None,​ tol=0.0001,
 +     ​verbose=0),​
 +       ​fit_params={},​ iid=True, loss_func=None,​ n_jobs=1,
 +       ​param_grid={'​C':​ array([ ​ 1.00000e-02, ​  ​1.00000e-01, ​  ​1.00000e+00, ​  ​1.00000e+01,​
 +         ​1.00000e+02, ​  ​1.00000e+03])},​
 +       ​pre_dispatch='​2*n_jobs',​ refit=True, score_func=None,​ scoring=None,​
 +       ​verbose=0)
 +
 +In [44]: 
 +
 +In [44]: # print the best accuracy, classifier and parameters:
 +
 +In [45]: print classifier.best_score_
 +0.844444444444
 +
 +In [46]: print classifier.best_estimator_
 +LinearSVC(C=1.0,​ class_weight=None,​ dual=True, fit_intercept=True,​
 +     ​intercept_scaling=1,​ loss='​squared_hinge',​ max_iter=1000,​
 +     ​multi_class='​ovr',​ penalty='​l2',​ random_state=None,​ tol=0.0001,
 +     ​verbose=0)
 +
 +In [47]: print classifier.best_params_
 +{'​C':​ 1.0}
 +
 +n [48]: # performing nested cross validation:
 +
 +In [49]: print  cross_validation.cross_val_score(classifier,​ X, y, cv=5)
 +[ 0.7962963 ​  ​0.81481481 ​ 0.88888889 ​ 0.83333333 ​ 0.83333333]
 +
 +In [50]: # if we want to do grid search over multiple parameters:
 +
 +In [51]: param_grid = [
 +   ​....: ​  ​{'​C':​ [1, 10, 100, 1000], '​kernel':​ ['​linear'​]},​
 +   ​....: ​  ​{'​C':​ [1, 10, 100, 1000], '​gamma':​ [0.001, 0.0001], '​kernel':​ ['​rbf'​]},​
 +   ​....: ​ ]
 +
 +In [52]: classifier = GridSearchCV(estimator=svm.SVC(),​ param_grid=param_grid)
 +
 +In [53]: print cross_validation.cross_val_score(classifier,​ X, y, cv=5)
 +[ 0.7962963 ​  ​0.83333333 ​ 0.88888889 ​ 0.7962963 ​  ​0.87037037]
 +
 +</​code>​
 +
code/model_selection.txt ยท Last modified: 2016/10/06 14:58 by asa