Filtering out outliers in Pandas dataframe with rolling median Filtering out outliers in Pandas dataframe with rolling median pandas pandas

Filtering out outliers in Pandas dataframe with rolling median


Just filter the dataframe

df['median']= df['b'].rolling(window).median()df['std'] = df['b'].rolling(window).std()#filter setupdf = df[(df.b <= df['median']+3*df['std']) & (df.b >= df['median']-3*df['std'])]


There might well be a more pandastic way to do this - this is a bit of a hack, relying on a sorta manual way of mapping the original df's index to each rolling window. (I picked size 6). The records up and until row 6 are associated with the first window; row 7 is the second window, and so on.

n = 100df = pd.DataFrame(np.random.randint(0,n,size=(n,2)), columns = ['a','b'])## set window sizewindow=6std = 1  # I set it at just 1; with real data and larger windows, can be larger## create df with rolling stats, upper and lower boundsbounds = pd.DataFrame({'median':df['b'].rolling(window).median(),'std':df['b'].rolling(window).std()})bounds['upper']=bounds['median']+bounds['std']*stdbounds['lower']=bounds['median']-bounds['std']*std## here, we set an identifier for each window which maps to the original df## the first six rows are the first window; then each additional row is a new windowbounds['window_id']=np.append(np.zeros(window),np.arange(1,n-window+1))## then we can assign the original 'b' value back to the bounds dfbounds['b']=df['b']## and finally, keep only rows where b falls within the desired boundsbounds.loc[bounds.eval("lower<b<upper")]


This is my take on creating a median filter:

def median_filter(num_std=3):    def _median_filter(x):        _median = np.median(x)        _std = np.std(x)        s = x[-1]        return s if s >= _median - num_std * _std and s <= _median + num_std * _std else np.nan    return _median_filterdf.y.rolling(window).apply(median_filter(num_std=3), raw=True)