Pandas: filling missing values by mean in each group
One way would be to use transform
:
>>> df name value0 A 11 A NaN2 B NaN3 B 24 B 35 B 16 C 37 C NaN8 C 3>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))>>> df name value0 A 11 A 12 B 23 B 24 B 35 B 16 C 37 C 38 C 3
fillna
+ groupby
+ transform
+ mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby
+ transform
syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda
function.
@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame( { 'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'], 'name': ['A','A', 'B','B','B','B', 'C','C','C'], 'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30], 'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], })
... gives ...
category name other_value value0 X A 10.0 1.01 X A NaN NaN2 X B NaN NaN3 X B 20.0 2.04 X B 30.0 3.05 X B 10.0 1.06 Y C 30.0 3.07 Y C NaN NaN8 Y C 30.0 3.0
In this generalized case we would like to group by category
and name
, and impute only on value
.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\ .transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value
column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = Nonefor _ in range(10000): if big_df is None: big_df = df.copy() else: big_df = pd.concat([big_df, df])df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pdfrom datetime import datetimedef generate_data(): ...t = datetime.now()df = generate_data()df['value'] = df.groupby(['category', 'name'])['value']\ .transform(lambda x: x.fillna(x.mean()))print(datetime.now()-t)# 0:00:00.016012t = datetime.now()df = generate_data()df["value"] = df.groupby(['category', 'name'])\ .transform(lambda x: x.fillna(x.mean()))['value']print(datetime.now()-t)# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\ .transform(lambda x: x.fillna(x.mean()))