How to split/partition a dataset into training and test datasets for, e.g., cross validation?

python arrays optimization numpy

If you want to split the data set once in two parts, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices (remember to fix the random seed to make everything reproducible):

import numpy# x is your datasetx = numpy.random.rand(100, 5)numpy.random.shuffle(x)training, test = x[:80,:], x[80:,:]

import numpy# x is your datasetx = numpy.random.rand(100, 5)indices = numpy.random.permutation(x.shape[0])training_idx, test_idx = indices[:80], indices[80:]training, test = x[training_idx,:], x[test_idx,:]

There are many ways other ways to repeatedly partition the same data set for cross validation. Many of those are available in the sklearn library (k-fold, leave-n-out, ...). sklearn also includes more advanced "stratified sampling" methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.

python arrays optimization numpy

There is another option that just entails using scikit-learn. As scikit's wiki describes, you can just use the following instructions:

from sklearn.model_selection import train_test_splitdata, labels = np.arange(10).reshape((5, 2)), range(5)data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

This way you can keep in sync the labels for the data you're trying to split into training and test.

python arrays optimization numpy

Just a note. In case you want train, test, AND validation sets, you can do this:

from sklearn.cross_validation import train_test_splitX = get_my_X()y = get_my_y()x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

These parameters will give 70 % to training, and 15 % each to test and val sets. Hope this helps.

CodeHunter

How to split/partition a dataset into training and test datasets for, e.g., cross validation?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last