Speeding up calculation of nearby groups?
Its clear that the problem is indexing the main dataframe, with the isin
method. As the dataframe grows in length a much larger search has to be done. I propose you do that same search, on the smaller df_groups
data frame and calculate an updated average instead.
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)), columns=['Group','Dist1','Dist2','Value'])distances = [1,2]# get means of all values and count, the totals for each sampledf_groups = df.groupby('Group')[['Dist1','Dist2','Value']].agg({'Dist1':'mean','Dist2':'mean', 'Value':['mean','count']})# remove multicolumn indexdf_groups.columns = [' '.join(col).strip() for col in df_groups.columns.values] #Rename columns df_groups.rename(columns={'Dist1 mean':'Dist1','Dist2 mean':'Dist2','Value mean':'Value','Value count': 'Count'},inplace=True)# create KDTree for quick searchingtree = cKDTree(df_groups[['Dist1','Dist2']])for i in distances: closeby = tree.query_ball_tree(tree, r=i) # put into density column df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby] #create column to look for subsets df_groups['subs'] = [df_groups.index.values[idx] for idx in closeby] #set this column to prep updated mean calculation df_groups['ComMean'] = df_groups['Value'] * df_groups['Count'] #perform updated mean df_groups[str(i) + '_mean_values'] = [(df_groups.loc[df_groups.index.isin(row), 'ComMean'].sum() / df_groups.loc[df_groups.index.isin(row), 'Count'].sum()) for row in df_groups['subs']] df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles', str(i) + '_mean_values']], left_on='Group', right_index=True)
the formula for and upated mean is just (m1*n1 + m2*n2)/(n1+n2)
old setup 100000 rows%timeit old(df)1 loop, best of 3: 694 ms per loop1000000 rows%timeit old(df)1 loop, best of 3: 6.08 s per loop10000000 rows%timeit old(df)1 loop, best of 3: 6min 13s per loop
new setup
100000 rows%timeit new(df)10 loops, best of 3: 136 ms per loop1000000 rows%timeit new(df)1 loop, best of 3: 525 ms per loop10000000 rows%timeit new(df)1 loop, best of 3: 4.53 s per loop