When using feature selection you need to be very careful in how you evaluate your classifier.
Here's the wrong way of doing it:
from PyML import * # the wrong way of using feature selection data = SparseDataSet('colon.data') # distinguish between normal tissue and tissue affected by colon cancer # data is available from: # http://mldata.org/repository/data/viewslug/colon-cancer/ # create an instance of the RFE feature selection method rfe = featsel.RFE() # a feature selector's train method selects a subset of features rfe.train(data) results1 = SVM().stratifiedCV(data)
If you run this you will get a classifier with perfect accuracy. Now let's do it the right way:
# the right way to perform feature selection: # feature selection is performed as part of training the classifier data = SparseDataSet('colon.data') results2 = composite.FeatureSelect(SVM(), featsel.RFE()).stratifiedCV(data)