Machine learning project: split training/test sets before or after exploratory data analysis?

r machine-learning data-analysis

To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).

Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.

Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.

Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.

This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.

r machine-learning data-analysis

I would take the problem the other way round; is it bad to use the test set ?

The objective of modeling is to end up with a model with low variance (and small bias): that's why the test set is keeping a bunch of data aside to assess how your model behaves with new data (i.e. its variance). If you use the test set during modeling you are left with nothing to do that, and you are overfitting your data.
The objective of EDA is to understand the data you're working with; the distributions of features, their relationships, their dynamics, etc ... If you leave your test set in the data, is there a risk of "overfitting" your understanding of data ? If that was the case, you would observe on say 70% of your data some properties that are not valid for the 30% remaining (test set) ... knowing that the split is random, this is impossible, or you have been extremely unlucky.

r machine-learning data-analysis

From my understanding in Machine Learning Pipeline is exploratory data analysis should be done before splitting the data into train and test.

Here are my reasons:

The data may not be cleaned in the beginning. It might have missing values, mismatch datatypes and outliers.
Need to understand every features with the target variable in the dataset. This will help to understand the importance of every features with respect to the business problem and will help to derive the additional features as well.
The data visualization will also help to get the insights information from the dataset.

Once the above operations done, then we can split the dataset into train and test. Because the features must be similar in both train and test.

CodeHunter

Machine learning project: split training/test sets before or after exploratory data analysis?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last