Pandas Dataframe groupby describe 8x ~slower than computing separatly Pandas Dataframe groupby describe 8x ~slower than computing separatly python python

Pandas Dataframe groupby describe 8x ~slower than computing separatly


Educated guess

I will post this as an answer, maybe to be deleted later, since it is more of an educated guess than an actual answer. Also it is a bit too long for a comment.

So the first thing I did after reading your answer was to re-run your timings in a profiler to take a closer look at the issue. Since the time of the computation itself was rather short, it got overshadowed quite bit by the data generation. However in general the times were similar to what you described. Not only that, the difference got even more pronounced:
1094ms for the first approach vs 63ms for the second one. This makes for the factor of 17 difference.

Since the lower of the times was rather small, I decided it was too small to be trusted and re-run the test with *10 generated data sample size. It upped the data generation step to one minute, and the numbers got weird:1173ms for the first approach vs 506ms for the second one. Factor only slightly worse than two.

I started to suspect something. To confirm my suspicion I run one last test once again increasing the data size by the factor of 10. Result might surprise you:
12258ms for the first approach vs 3646ms for the second one. The tables have turned completely, the factor being ~0.3.

My guess in this case is that pandas computation is actually the one with better optimization/algorithm. However since it is pandas, it has quite a bit of additional baggage around it - the price that is paid for the convenience and robustness. This means that there is a layer of "unnecessary" (computation wise) baggage, that needs to be carried around no matter how large the data set is.

So in case you want to be faster than pandas even on data sets of your size, take their operations and write them yourself - in the most straightforward manner possible.This will keep their optimization & discard the baggage paid for convenience.


Note: this answer is for pandas version 1.0.5 . Things might be different for other versions.

tl;dr

pandas describe() method will always be slower than your version, because under the hood it's using almost the exact same logic, plus some other things like making sure the data has the right dimensions, ordering the results, and checking for NaNs and the correct data types.


Longer Answer

Taking a look at the source code of the describe method, we can see a few things:

  • pandas uses the same logic as your code to compute the statistics. See this line inside the describe() method for an example of how it's using the same logic. This means that pandas describe will always be slower.
  • pandas counts non-NaN values using s.count(), but your code counts all values. Let's try and modify your code to use that same method instead of len():
def summarize_numeric_3(df, filt, groups, values, quantiles):     dfg = df[mask].groupby(groups)[values]    dfg_stats = dfg.agg([np.mean, np.std, pd.Series.count]).stack()    dfg_quantiles = dfg.quantile(all_quantiles)    return dfg_stats.append(dfg_quantiles).sort_index()%timeit -n 10 summarize_numeric_3(df, mask, groups, values, all_quantiles)# outputs# 48.9 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

it takes ~49 ms, vs 42 ms for your version on my machine. It took 7 more ms just for this relatively small modification!

  • pandas does a lot more than your code to ensure the data has the correct type and shape, and to present it in a pretty format. I've extracted the pandas describe method code into a "self-contained"* version that you can profile and tinker with here (too long to be posted in this answer). Profiling that, I see that a very large portion of the time is taken to "set a convenient order for rows". Removing that ordering improved the describe timing by ~8%, from 530 ms down to 489 ms.