Machine learning project: split training/test sets before or after exploratory data analysis? Machine learning project: split training/test sets before or after exploratory data analysis? r r

Machine learning project: split training/test sets before or after exploratory data analysis?


To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).

Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.

Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.

Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.

This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.


I would take the problem the other way round; is it bad to use the test set ?

  • The objective of modeling is to end up with a model with low variance (and small bias): that's why the test set is keeping a bunch of data aside to assess how your model behaves with new data (i.e. its variance). If you use the test set during modeling you are left with nothing to do that, and you are overfitting your data.

  • The objective of EDA is to understand the data you're working with; the distributions of features, their relationships, their dynamics, etc ... If you leave your test set in the data, is there a risk of "overfitting" your understanding of data ? If that was the case, you would observe on say 70% of your data some properties that are not valid for the 30% remaining (test set) ... knowing that the split is random, this is impossible, or you have been extremely unlucky.


From my understanding in Machine Learning Pipeline is exploratory data analysis should be done before splitting the data into train and test.

Here are my reasons:

  • The data may not be cleaned in the beginning. It might have missing values, mismatch datatypes and outliers.
  • Need to understand every features with the target variable in the dataset. This will help to understand the importance of every features with respect to the business problem and will help to derive the additional features as well.
  • The data visualization will also help to get the insights information from the dataset.

Once the above operations done, then we can split the dataset into train and test. Because the features must be similar in both train and test.