Why should we perform a Kfold cross validation on test set??

Question 1

I was working on a knearest neighbours problem set. I couldn't understand why are they performing K fold cross validation on test set?? Cant we directly test how well our best parameter K performed on the entire test data? rather than doing a cross validation?

iris = sklearn.datasets.load_iris()
X = iris.data 
Y = iris.target
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
 X, Y, test_size=0.33, random_state=42)
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
def computeTestScores(test_x, test_y, clf, cv):
 kFolds = sklearn.cross_validation.KFold(test_x.shape[0], n_folds=cv)
 scores = []
 for _, test_index in kFolds:
 test_data = test_x[test_index]
 test_labels = test_y[test_index]
 scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(test_data)))
 return scores
scores = computeTestScores(test_x = X_test, test_y = Y_test, clf=clf, cv=5)

Question 2

TL;DR

Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'

You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.

The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.

Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.

In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.

Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.

Question 3

The _ indicates your are throwing away what would normally be the train_index (see scikit-learn.org/stable/modules/generated/…). You are are basically running the final test on different subsets of the test set so you can get an idea for the range of scores likely on real new data.

Question 4

The Parameters are the K nearest neighoburs, but What are the hyper parameters in K nearest neighbours?? Is it similar to C in SVM and logistic regression??

Question 5

KNeightbours is doesn't fit the above paradigm very well. There is (almost) nothing done in the estimator.fit() call and most computation is done only on estimator.predict(). The algorithm doesn't really have any parameters in the sense of my answer---the whole training set is 'sort of' taken as the param space. k:=n_neighbors is a hyperparameter.

Question 6

Now I got what is meant by hyper parameters, I have done coursera machine learning course, how do I further improve my knowledge in machine learning?? especially implementing these algorithims in scikit learn

Question 7

try the videos of the SciPy2013 tutorial conference.scipy.org/scipy2013/tutorial_detail.php?id=107 you'll want the git repo here github.com/jakevdp/sklearn_scipy2013

Question 8

If you make a program that adapts to input, then it will be optimal for the input you adapted it to.

This leads to a problem known as overfitting.

In order to see if you have made a good or a bad model, you need to test it on some other data that is not what you used to make the model. This is why you separate your data into 2 parts.

Question 9

But already we have performed 10 fold cross validation on training data and then selected the best parameter K. Then why should we again perform a cross validation on testing data ??

Question 10

Because you chose the best K for the 10 fold validation on the training data, your K is now attuned to any abnormalities in your training data. If this attunement is large, the result from the test set will likely be poor. You can probably see this if you choose an extremely small (1 data point if possible) training set, calculate K and run it on the rest of the data set as test data.

Laurence Billingham Laurence Billingham 8137 silver badges19 bronze badges · Accepted Answer · 2015-01-22 08:36:53Z

TL;DR

Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'

You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.

The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.

Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.

In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.

Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.

The _ indicates your are throwing away what would normally be the train_index (see scikit-learn.org/stable/modules/generated/…). You are are basically running the final test on different subsets of the test set so you can get an idea for the range of scores likely on real new data.
The Parameters are the K nearest neighoburs, but What are the hyper parameters in K nearest neighbours?? Is it similar to C in SVM and logistic regression??
KNeightbours is doesn't fit the above paradigm very well. There is (almost) nothing done in the estimator.fit() call and most computation is done only on estimator.predict(). The algorithm doesn't really have any parameters in the sense of my answer---the whole training set is 'sort of' taken as the param space. k:=n_neighbors is a hyperparameter.
Now I got what is meant by hyper parameters, I have done coursera machine learning course, how do I further improve my knowledge in machine learning?? especially implementing these algorithims in scikit learn
try the videos of the SciPy2013 tutorial conference.scipy.org/scipy2013/tutorial_detail.php?id=107 you'll want the git repo here github.com/jakevdp/sklearn_scipy2013

CollectivesTM on Stack Overflow

Why should we perform a Kfold cross validation on test set??

2 Answers 2

TL;DR

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

TL;DR

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related