Fill in missing pandas data with previous non-missing value, grouped by key Fill in missing pandas data with previous non-missing value, grouped by key python python

Fill in missing pandas data with previous non-missing value, grouped by key


You could perform a groupby/forward-fill operation on each group:

import numpy as npimport pandas as pddf = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]})df['x'] = df.groupby(['id'])['x'].ffill()print(df)

yields

   id      x0   1   10.01   1   20.02   2  100.03   2  200.04   1   20.05   2  200.06   1  300.07   1  300.0


df   id   val0   1   23.01   1   NaN2   1   NaN3   2   NaN4   2   34.05   2   NaN6   3   2.07   3   NaN8   3   NaNdf.sort_values(['id','val']).groupby('id').ffill()    id  val0   1   23.01   1   23.02   1   23.04   2   34.03   2   34.05   2   34.06   3   2.07   3   2.08   3   2.0

use sort_values, groupby and ffill so that if you have Nan value for the first value or set of first values they also get filled.


Solution for multi-key problem:

In this example, the data has the key [date, region, type]. Date is the index on the original dataframe.

import osimport pandas as pd#sort to make indexing fasterdf.sort_values(by=['date','region','type'], inplace=True)#collect all possible regions and typesregions = list(set(df['region']))types = list(set(df['type']))#record column namesdf_cols = df.columns#delete ffill_df.csv so we can begin anewtry:    os.remove('ffill_df.csv')except FileNotFoundError:    pass# steps:# 1) grab rows with a particular region and type# 2) use forwardfill to fill nulls# 3) use backwardfill to fill remaining nulls# 4) append to filefor r in regions:    for t in types:        group_df = df[(df.region == r) & (df.type == t)].copy()        group_df.fillna(method='ffill', inplace=True)        group_df.fillna(method='bfill', inplace=True)        group_df.to_csv('ffill_df.csv', mode='a', header=False, index=True) 

Checking the result:

#load in the ffill_dfffill_df = pd.read_csv('ffill_df.csv', header=None, index_col=None)ffill_df.columns = df_reindexed_colsffill_df.index= ffill_df.dateffill_df.drop('date', axis=1, inplace=True)ffill_df.head()#compare new and old dataframeprint(df.shape)        print(ffill_df.shape)print()print(pd.isnull(ffill_df).sum())