Remove Outliers in Pandas DataFrame using Percentiles

Use this code and don't waste your time:

Q1 = df.quantile(0.25)Q3 = df.quantile(0.75)IQR = Q3 - Q1df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

in case you want specific columns:

cols = ['col_1', 'col_2'] # one or moreQ1 = df[cols].quantile(0.25)Q3 = df[cols].quantile(0.75)IQR = Q3 - Q1df = df[~((df[cols] < (Q1 - 1.5 * IQR)) |(df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]

python pandas outliers

The initial dataset.

print(df.head())   Col0  Col1  Col2  Col3  Col4  User_id0    49    31    93    53    39       441    69    13    84    58    24       472    41    71     2    43    58       643    35    56    69    55    36       674    64    24    12    18    99       67

First removing the User_id column

filt_df = df.loc[:, df.columns != 'User_id']

Then, computing percentiles.

low = .05high = .95quant_df = filt_df.quantile([low, high])print(quant_df)       Col0   Col1  Col2   Col3   Col40.05   2.00   3.00   6.9   3.95   4.000.95  95.05  89.05  93.0  94.00  97.05

Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) &                                     (x < quant_df.loc[high,x.name])], axis=0)

Bringing the User_id back.

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

Last, rows with NaN values can be dropped simply like this.

filt_df.dropna(inplace=True)print(filt_df.head())   User_id  Col0  Col1  Col2  Col3  Col41       47    69    13    84    58    243       67    35    56    69    55    365        9    95    79    44    45    696       83    69    41    66    87     69       87    50    54    39    53    40

Checking result

print(filt_df.head())   User_id  Col0  Col1  Col2  Col3  Col40       44    49    31   NaN    53    391       47    69    13    84    58    242       64    41    71   NaN    43    583       67    35    56    69    55    364       67    64    24    12    18   NaNprint(filt_df.describe())          User_id       Col0       Col1       Col2       Col3       Col4count  100.000000  89.000000  88.000000  88.000000  89.000000  89.000000mean    48.230000  49.573034  45.659091  52.727273  47.460674  57.157303std     28.372292  25.672274  23.537149  26.509477  25.823728  26.231876min      0.000000   3.000000   5.000000   7.000000   4.000000   5.00000025%     23.000000  29.000000  29.000000  29.500000  24.000000  36.00000050%     47.000000  50.000000  40.500000  52.500000  49.000000  59.00000075%     74.250000  69.000000  67.000000  75.000000  70.000000  79.000000max     99.000000  95.000000  89.000000  92.000000  91.000000  97.000000

How to generate the test dataset

np.random.seed(0)nb_sample = 100num_sample = (0,100)d = dict()d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)for i in range(5):    d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)df = DataFrame.from_dict(d)

python pandas outliers

What you are describing is similar to the process of winsorizing, which clips values (for example, at the 5th and 95th percentiles) instead of eliminating them completely.

Here's an example:

import pandas as pdfrom scipy.stats import mstats%matplotlib inlinetest_data = pd.Series(range(30))test_data.plot()

# Truncate values to the 5th and 95th percentilestransformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) transformed_test_data.plot()

CodeHunter

Remove Outliers in Pandas DataFrame using Percentiles

Checking result

How to generate the test dataset

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last