Cross validation. Sklearn.model_selection, sklearn.cross_validation

Hello everybody,

today I want to describe a bit more about cross validation and how to work with it in Python. Here I describe how to split learning set once or how to use different cross validation strategies. 

Before I'll continue, I'd like to describe different types of splitting data for training. 

There are following ways to split data:

1. Split on 70/30 ( sometime 80/20 ) on two sets: Training data, Validation data. 

You train data on training data, and validate on holdout data. 

This approach has following pros/cons:

  • (+) model is trained only once
  • (-) depends from the splitting
  • (+/-) works fine for big data sets

In order to understand one more case take a look at following picture:

Let's say you do task like predicting price of accomodataion. And when you split your data, you have didn't take into account special accomodations of premium segment. And by chance those special objects went into hold out data set. What will be the quality of your model? Such cases may require more systematic approach. 

Cross validation

With cross validation data set is split on some number of blocks, and each splitted part will have two parts: learning and control. Take a look at this schema:

At this picture you can see something that can be named 4-Fold cross-validation.

Such kind of training has following pros and cons depending from number of parts:

if number of parts is small, we have fllowing pros/cons list:

small number of parts:

  • (+) reliable estimates
  • (-) biased estimates

for big number of parts:

  • (-) enreliable estimates
  • (+) un biased estimates

There are not any recommendations of what should be size of splits, but general rule of thumb is the bigger data set, the smaller k you need. As usually k is set to 3, 5, 10. One more piece of advice for training. It is good idea to mix data set. That should be done because sometime it happens that in data file information can be sorted according to some feature. For example sorted by gender. Or sorted by price in descending order, etc. So in order to avoid potential of having some execptional data in your control set, shuffle traing data set. 

Sklearn model_selection

Keeping those ideas in mind lets jump in python sklearn library. First of all let's import data for iris, and split our data with help of train_test_split of model_selection. It can be done line this:

from sklearn import datasetsmodel_selection
import numpy as np
 
iris = datasets.load_iris()
 
#single splitting of data on training and test with help of train_test_split
 
trainData, testData, trainLables, testLabels = model_selection.train_test_split(iris.data, iris.target, test_size = 0.3)
 
#lets check and see that result training set length is 0.3
print(float(len(testLabels))/len(iris.data))
 

It will give as output 0.3.

For the next part I want to introduce you to analyzing sizes of test data and train data. You can do it like this:

print( 'size of size of training set={} test set {}'.format(len(trainData), len(testData) ))

as output you'll get 

size of size of training set=105 test set 45

also in Visual Studio with installed python you can see slice of train and test set like this:

print( 'size of size of training set={} test set {}'.format(len(trainData), len(testData) ))
 
print ('train data:\n', trainData[:5])
print('\n')
print('test data:\n', testData[:5])

and here is what it will give you:

Also you can see how lables look like with following code:

#analyze visually labels:
print('train labels:\n', trainLables)
print('test labels:\n', testLabels)

here is what you'll see:

train labels:
 [2 2 2 0 2 1 2 0 1 2 2 1 2 0 1 1 2 1 0 0 0 0 2 1 2 1 2 2 2 2 0 1 1 1 0 0 0
 2 0 1 2 1 1 0 2 1 1 0 0 1 0 2 0 2 2 0 2 1 2 2 1 0 1 0 1 1 0 2 1 2 1 1 2 1
 0 0 0 0 1 1 2 1 2 0 1 2 2 0 0 1 0 2 0 2 0 1 2 0 0 0 1 2 0 1 1]
test labels:
 [1 0 0 2 0 2 0 2 1 1 0 1 0 0 1 0 2 2 1 1 1 1 2 1 1 0 1 0 2 0 2 2 1 2 0 1 2
 2 2 1 0 2 0 0 2]

such options add some convenience. 

Strategies of cross-validation

KFold cross validation means splitting data set on K groups and each group participates K times in testing and K-1 times in training. In scikit there is a function KFold. If to compare KFold function with train_test_split it's worthy to notice that KFold doesn't split data set. It gives indices which you can use for doing cross-validation. Take a look at the following code:

print('Two parameters')
for trainIndices, testIndices in cross_validation.KFold(10, n_folds=5):
    print(trainIndices, testIndices)
 
print('Three parameters with shuffling')
for trainIndices, testIndices in cross_validation.KFold(10, n_folds=2, shuffle=True):
    print(trainIndices, testIndices)
 
print('Three parameters with shuffling and random state')
for trainIndices, testIndices in cross_validation.KFold(10, n_folds=2, shuffle=True, random_state=1):
    print(trainIndices, testIndices)

as you can see from code, it doesn't take as parameter any chunk of data set. And below goes output:

Two parameters
[2 3 4 5 6 7 8 9] [0 1]
[0 1 4 5 6 7 8 9] [2 3]
[0 1 2 3 6 7 8 9] [4 5]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 4 5 6 7] [8 9]
Three parameters with shuffling
[1 5 7 8 9] [0 2 3 4 6]
[0 2 3 4 6] [1 5 7 8 9]
Three parameters with shuffling and random state
[1 3 5 7 8] [0 2 4 6 9]
[0 2 4 6 9] [1 3 5 7 8]

as you can see it outputs only indices, not values from data set. Keep that in mind. Also it means that later in code you can use those indices in training. 

One more point of attention is parameter random_state = 1. Basically it says that nexe callings of KFold function will reuse shuffled values and will not shuffle indices again.

StratifiedKFold

StratifiedKFold is very similar to KFold but with preserving initial distribution of data in origin dataset. Imagin that you have dataset of houses, and in that dataset there are 400 flats of one type and 233 of another type. I suppose you agree that it can be good idea to preserve such ratio in all learning sets. How to achieve it? With help of StratifiedKFold function. Take a look at code below:

#StratifiedKFold
print('\n')
target = np.array([0]*5 + [1]* 5) # prepare array for test
print( target)
 
print('StratifiedKFold')
 
for trainIndices, testIndicies in cross_validation.StratifiedKFold(target, n_folds=2, shuffle=True, random_state=0):
    print(trainIndices, testIndices)
     

that code gives the following output:

[0 0 0 0 0 1 1 1 1 1]
StratifiedKFold
[3 4 8 9] [1 3 5 7 8]
[0 1 2 5 6 7] [1 3 5 7 8]     

Check by yourself:

we have 5 zeros and 5 ones. It means that StratifiedKFold should give us 50x50 % division. And that is what we have:

Indicies 3,4  refer to 0 and 8,9 refer to one. 

In second line we have 0,1,2 refer to zero and 5,6,7 refer to one. Pretty cool and easy?

Now lets modify input array to look a bit differently and check how StratifiedKFold will work on it. Again take a look at the code:

target = np.array([0,1] * 5)
print('target zebra:', target)
 
for trainIndices, testIndicies in cross_validation.StratifiedKFold(target, n_folds=2, shuffle=True, random_state=0):
    print(trainIndices, testIndices)

and compare the output with previous:

target zebra: [0 1 0 1 0 1 0 1 0 1]
[6 7 8 9] [1 3 5 7 8]
[0 1 2 3 4 5] [1 3 5 7 8]

as you can see, indexes are totally different then they used to be before but ratio still remains the same.

ShuffleSplit

ShuffleSplit is a function that allows you to construct so-called random permutations and this function doesn't have any limitations on how many times some object should appear in learning or in test. Function takes as arguments: quantity of objects, numberf of iterations and size of test data set.

Watch at the code below:

print('Shuffle split\n')    
for trainIndices1, testIndicies1 in cross_validation.ShuffleSplit(10, n_iter = 10, test_size = 0.2):
    print(trainIndices1, testIndicies1)

it will give you the following output:

[6 9 1 8 5 7 2 4] [0 3]
[0 4 1 6 8 9 7 5] [2 3]
[1 0 2 4 6 8 9 7] [3 5]
[6 9 8 1 2 4 5 3] [7 0]
[5 0 3 8 1 2 6 7] [9 4]
[5 4 9 8 1 3 6 0] [7 2]
[9 3 5 2 1 8 0 4] [7 6]
[2 6 5 8 3 7 0 4] [1 9]
[4 3 8 0 2 5 1 6] [9 7]
[6 9 0 3 2 8 1 4] [7 5]

As you can see it shuffled your data in 10 possible ways for each of iterations. As you can see we don't have any limitations on how often some object should be in test ouptput. As you can see from output, object with index 7 happened to be 5 times, while object with index 1 only once. 

StratifiedShuffleSplit

But you can also stratify shuffle split as well ( stratify means preserving ratio of objects ). 

Take a look at another fragment of code":

# StratifiedShuffleSplit
target = np.array([0]*5 + [1]*5)
print(target)
for trainIndicies, testIndicies in cross_validation.StratifiedShuffleSplit(target, n_iter=4, test_size=0.2):
    print(trainIndicies, testIndicies )

it gave following output at mine computer:

[0 0 0 0 0 1 1 1 1 1]
[3 7 6 9 4 1 2 5] [8 0]
[6 4 3 1 9 0 5 7] [2 8]
[5 7 3 9 4 0 1 6] [2 8]
[0 6 7 8 1 9 3 2] [4 5]

Now in the test you can see that in test ( second group of [] we have once object of class zero and once object of class one ).

Leave-One-Out

Latest cross-validation strategy which I'll deomonstrate is Leave one out. That is strategy that allows to leave each object in test only once. It means that test set will have one object during training, and each object presented in test. Such a strategy is good in usage when we don't have big data set.

Take a look at code for this function:

print('Leave one out \n')
for trainIndices, testIndex in cross_validation.LeaveOneOut(10):
    print(trainIndices, testIndex )
[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]

as you can see from output each object is presented in test set only once.

No Comments

Add a Comment