Speeding up calculation of nearby groups?

python performance pandas numpy search

Its clear that the problem is indexing the main dataframe, with the isin method. As the dataframe grows in length a much larger search has to be done. I propose you do that same search, on the smaller df_groups data frame and calculate an updated average instead.

df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)),                  columns=['Group','Dist1','Dist2','Value'])distances = [1,2]# get means of all values and count, the totals for each sampledf_groups = df.groupby('Group')[['Dist1','Dist2','Value']].agg({'Dist1':'mean','Dist2':'mean',                                                                  'Value':['mean','count']})# remove multicolumn indexdf_groups.columns = [' '.join(col).strip() for col in df_groups.columns.values] #Rename columns df_groups.rename(columns={'Dist1 mean':'Dist1','Dist2 mean':'Dist2','Value mean':'Value','Value count':                          'Count'},inplace=True)# create KDTree for quick searchingtree = cKDTree(df_groups[['Dist1','Dist2']])for i in distances:    closeby = tree.query_ball_tree(tree, r=i)    # put into density column    df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]    #create column to look for subsets    df_groups['subs'] = [df_groups.index.values[idx] for idx in closeby]    #set this column to prep updated mean calculation    df_groups['ComMean'] = df_groups['Value'] * df_groups['Count']    #perform updated mean    df_groups[str(i) + '_mean_values'] = [(df_groups.loc[df_groups.index.isin(row), 'ComMean'].sum() /                                          df_groups.loc[df_groups.index.isin(row), 'Count'].sum()) for row in df_groups['subs']]    df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',                                 str(i) + '_mean_values']],                  left_on='Group',                  right_index=True)

the formula for and upated mean is just (m1*n1 + m2*n2)/(n1+n2)

old setup 100000 rows%timeit old(df)1 loop, best of 3: 694 ms per loop1000000 rows%timeit old(df)1 loop, best of 3: 6.08 s per loop10000000 rows%timeit old(df)1 loop, best of 3: 6min 13s per loop

new setup

100000 rows%timeit new(df)10 loops, best of 3: 136 ms per loop1000000 rows%timeit new(df)1 loop, best of 3: 525 ms per loop10000000 rows%timeit new(df)1 loop, best of 3: 4.53 s per loop

CodeHunter

Speeding up calculation of nearby groups?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last