Python - Pandas, Resample dataset to have balanced classes

python pandas numpy machine-learning dataset

A very simple approach. Taken from sklearn documentation and Kaggle.

from sklearn.utils import resampledf_majority = df[df.label==0]df_minority = df[df.label==1]# Upsample minority classdf_minority_upsampled = resample(df_minority,                                  replace=True,     # sample with replacement                                 n_samples=20,    # to match majority class                                 random_state=42) # reproducible results# Combine majority class with upsampled minority classdf_upsampled = pd.concat([df_majority, df_minority_upsampled])# Display new class countsdf_upsampled.label.value_counts()

python pandas numpy machine-learning dataset

Provided that each name is labeled by exactly one label (e.g. all A are 1) you can use the following:

Group the names by label and check which label has an excess (in terms of unique names).
Randomly remove names from the over-represented label class in order to account for the excess.
Select the part of the data frame which does not contain the removed names.

Here is the code:

labels = df.groupby('label').name.unique()# Sort the over-represented class to the head.labels = labels[labels.apply(len).sort_values(ascending=False).index]excess = len(labels.iloc[0]) - len(labels.iloc[1])remove = np.random.choice(labels.iloc[0], excess, replace=False)df2 = df[~df.name.isin(remove)]

python pandas numpy machine-learning dataset

Using imbalanced-learn (pip install imbalanced-learn), this is as simple as:

from imblearn.under_sampling import RandomUnderSamplerrus = RandomUnderSampler(sampling_strategy='not minority', random_state=1)df_balanced, balanced_labels = rus.fit_resample(df, df['label'])

There are many methods other than RandomUnderSampler, so I suggest you read the documentation.

CodeHunter

Python - Pandas, Resample dataset to have balanced classes

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last