Fill in missing pandas data with previous non-missing value, grouped by key
You could perform a groupby/forward-fill operation on each group:
import numpy as npimport pandas as pddf = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]})df['x'] = df.groupby(['id'])['x'].ffill()print(df)
yields
id x0 1 10.01 1 20.02 2 100.03 2 200.04 1 20.05 2 200.06 1 300.07 1 300.0
df id val0 1 23.01 1 NaN2 1 NaN3 2 NaN4 2 34.05 2 NaN6 3 2.07 3 NaN8 3 NaNdf.sort_values(['id','val']).groupby('id').ffill() id val0 1 23.01 1 23.02 1 23.04 2 34.03 2 34.05 2 34.06 3 2.07 3 2.08 3 2.0
use sort_values, groupby and ffill so that if you have Nan
value for the first value or set of first values they also get filled.
Solution for multi-key problem:
In this example, the data has the key [date, region, type]. Date is the index on the original dataframe.
import osimport pandas as pd#sort to make indexing fasterdf.sort_values(by=['date','region','type'], inplace=True)#collect all possible regions and typesregions = list(set(df['region']))types = list(set(df['type']))#record column namesdf_cols = df.columns#delete ffill_df.csv so we can begin anewtry: os.remove('ffill_df.csv')except FileNotFoundError: pass# steps:# 1) grab rows with a particular region and type# 2) use forwardfill to fill nulls# 3) use backwardfill to fill remaining nulls# 4) append to filefor r in regions: for t in types: group_df = df[(df.region == r) & (df.type == t)].copy() group_df.fillna(method='ffill', inplace=True) group_df.fillna(method='bfill', inplace=True) group_df.to_csv('ffill_df.csv', mode='a', header=False, index=True)
Checking the result:
#load in the ffill_dfffill_df = pd.read_csv('ffill_df.csv', header=None, index_col=None)ffill_df.columns = df_reindexed_colsffill_df.index= ffill_df.dateffill_df.drop('date', axis=1, inplace=True)ffill_df.head()#compare new and old dataframeprint(df.shape) print(ffill_df.shape)print()print(pd.isnull(ffill_df).sum())