I was working on a knearest neighbours problem set. I couldn't understand why are they performing K fold cross validation on test set?? Cant we directly test how well our best parameter K performed on the entire test data? rather than doing a cross validation?
iris = sklearn.datasets.load_iris()
X = iris.data 
Y = iris.target
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
 X, Y, test_size=0.33, random_state=42)
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
def computeTestScores(test_x, test_y, clf, cv):
 kFolds = sklearn.cross_validation.KFold(test_x.shape[0], n_folds=cv)
 scores = []
 for _, test_index in kFolds:
 test_data = test_x[test_index]
 test_labels = test_y[test_index]
 scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(test_data)))
 return scores
scores = computeTestScores(test_x = X_test, test_y = Y_test, clf=clf, cv=5)
2 Answers 2
TL;DR
Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'
You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.
The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.
- Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
- Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
- Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.
In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.
Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.
6 Comments
_ indicates your are throwing away what would normally be the train_index (see scikit-learn.org/stable/modules/generated/…). You are are basically running the final test on different subsets of the test set so you can get an idea for the range of scores likely on real new data.estimator.fit() call and most computation is done only on estimator.predict(). The algorithm doesn't really have any parameters in the sense of my answer---the whole training set is 'sort of' taken as the param space. k:=n_neighbors is a hyperparameter.If you make a program that adapts to input, then it will be optimal for the input you adapted it to.
This leads to a problem known as overfitting.
In order to see if you have made a good or a bad model, you need to test it on some other data that is not what you used to make the model. This is why you separate your data into 2 parts.
2 Comments
Explore related questions
See similar questions with these tags.