Detect and exclude outliers in Pandas data frame Detect and exclude outliers in Pandas data frame python python

Detect and exclude outliers in Pandas data frame


If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.

df = pd.DataFrame(np.random.randn(100, 3))from scipy import statsdf[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

description:

  • For each column, it first computes the Z-score of each value in thecolumn, relative to the column mean and standard deviation.
  • It then takes the absolute Z-score because the direction does notmatter, only if it is below the threshold.
  • all(axis=1) ensures that for each row, all column satisfy theconstraint.
  • Finally, the result of this condition is used to index the dataframe.

Filter other columns based on a single column

  • Specify a column for the zscore, df[0] for example, and remove .all(axis=1).
df[(np.abs(stats.zscore(df[0])) < 3)]


Use boolean indexing as you would do in numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})# example dataset of normally distributed data. df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]# or if you prefer the other way around

For a series it is similar:

S = pd.Series(np.random.normal(size=200))S[~((S-S.mean()).abs() > 3*S.std())]


For each of your dataframe column, you could get quantile with:

q = df["col"].quantile(0.99)

and then filter with:

df[df["col"] < q]

If one need to remove lower and upper outliers, combine condition with an AND statement:

q_low = df["col"].quantile(0.01)q_hi  = df["col"].quantile(0.99)df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]