Scikit-learn train_test_split with indices Scikit-learn train_test_split with indices python python

Scikit-learn train_test_split with indices


You can use pandas dataframes or series as Julien said but if you want to restrict your-self to numpy you can pass an additional array of indices:

from sklearn.model_selection import train_test_splitimport numpy as npn_samples, n_features, n_classes = 10, 2, 2data = np.random.randn(n_samples, n_features)  # 10 training exampleslabels = np.random.randint(n_classes, size=n_samples)  # 10 labelsindices = np.arange(n_samples)(    data_train,    data_test,    labels_train,    labels_test,    indices_train,    indices_test,) = train_test_split(data, labels, indices, test_size=0.2)


Scikit learn plays really well with Pandas, so I suggest you use it. Here's an example:

In [1]: import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitdata = np.reshape(np.random.randn(20),(10,2)) # 10 training exampleslabels = np.random.randint(2, size=10) # 10 labelsIn [2]: # Giving columns in X a nameX = pd.DataFrame(data, columns=['Column_1', 'Column_2'])y = pd.Series(labels)In [3]:X_train, X_test, y_train, y_test = train_test_split(X, y,                                                     test_size=0.2,                                                     random_state=0)In [4]: X_testOut[4]:     Column_1    Column_22   -1.39       -1.868    0.48       -0.814   -0.10       -1.83In [5]: y_testOut[5]:2    18    14    1dtype: int32

You can directly call any scikit functions on DataFrame/Series and it will work.

Let's say you wanted to do a LogisticRegression, here's how you could retrieve the coefficients in a nice way:

In [6]: from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()model = model.fit(X_train, y_train)# Retrieve coefficients: index is the feature name (['Column_1', 'Column_2'] here)df_coefs = pd.DataFrame(model.coef_[0], index=X.columns, columns = ['Coefficient'])df_coefsOut[6]:            CoefficientColumn_1    0.076987Column_2    -0.352463


Here's the simplest solution (Jibwa made it seem complicated in another answer), without having to generate indices yourself - just using the ShuffleSplit object to generate 1 split.

import numpy as np from sklearn.model_selection import ShuffleSplit # or StratifiedShuffleSplitsss = ShuffleSplit(n_splits=1, test_size=0.1)data_size = 100X = np.reshape(np.random.rand(data_size*2),(data_size,2))y = np.random.randint(2, size=data_size)sss.get_n_splits(X, y)train_index, test_index = next(sss.split(X, y)) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index]