Remove Outliers in Pandas DataFrame using Percentiles Remove Outliers in Pandas DataFrame using Percentiles python python

Remove Outliers in Pandas DataFrame using Percentiles


Use this code and don't waste your time:

Q1 = df.quantile(0.25)Q3 = df.quantile(0.75)IQR = Q3 - Q1df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

in case you want specific columns:

cols = ['col_1', 'col_2'] # one or moreQ1 = df[cols].quantile(0.25)Q3 = df[cols].quantile(0.75)IQR = Q3 - Q1df = df[~((df[cols] < (Q1 - 1.5 * IQR)) |(df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]


The initial dataset.

print(df.head())   Col0  Col1  Col2  Col3  Col4  User_id0    49    31    93    53    39       441    69    13    84    58    24       472    41    71     2    43    58       643    35    56    69    55    36       674    64    24    12    18    99       67

First removing the User_id column

filt_df = df.loc[:, df.columns != 'User_id']

Then, computing percentiles.

low = .05high = .95quant_df = filt_df.quantile([low, high])print(quant_df)       Col0   Col1  Col2   Col3   Col40.05   2.00   3.00   6.9   3.95   4.000.95  95.05  89.05  93.0  94.00  97.05

Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) &                                     (x < quant_df.loc[high,x.name])], axis=0)

Bringing the User_id back.

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

Last, rows with NaN values can be dropped simply like this.

filt_df.dropna(inplace=True)print(filt_df.head())   User_id  Col0  Col1  Col2  Col3  Col41       47    69    13    84    58    243       67    35    56    69    55    365        9    95    79    44    45    696       83    69    41    66    87     69       87    50    54    39    53    40

Checking result

print(filt_df.head())   User_id  Col0  Col1  Col2  Col3  Col40       44    49    31   NaN    53    391       47    69    13    84    58    242       64    41    71   NaN    43    583       67    35    56    69    55    364       67    64    24    12    18   NaNprint(filt_df.describe())          User_id       Col0       Col1       Col2       Col3       Col4count  100.000000  89.000000  88.000000  88.000000  89.000000  89.000000mean    48.230000  49.573034  45.659091  52.727273  47.460674  57.157303std     28.372292  25.672274  23.537149  26.509477  25.823728  26.231876min      0.000000   3.000000   5.000000   7.000000   4.000000   5.00000025%     23.000000  29.000000  29.000000  29.500000  24.000000  36.00000050%     47.000000  50.000000  40.500000  52.500000  49.000000  59.00000075%     74.250000  69.000000  67.000000  75.000000  70.000000  79.000000max     99.000000  95.000000  89.000000  92.000000  91.000000  97.000000

How to generate the test dataset

np.random.seed(0)nb_sample = 100num_sample = (0,100)d = dict()d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)for i in range(5):    d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)df = DataFrame.from_dict(d)


What you are describing is similar to the process of winsorizing, which clips values (for example, at the 5th and 95th percentiles) instead of eliminating them completely.

Here's an example:

import pandas as pdfrom scipy.stats import mstats%matplotlib inlinetest_data = pd.Series(range(30))test_data.plot()

Original data

# Truncate values to the 5th and 95th percentilestransformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) transformed_test_data.plot()

Winsorized data