Pandas: df.groupby() is too slow for big data set. Any alternatives methods? Pandas: df.groupby() is too slow for big data set. Any alternatives methods? pandas pandas

Pandas: df.groupby() is too slow for big data set. Any alternatives methods?


The problem is that your data are not numeric. Processing strings takes a lot longer than processing numbers. Try this first:

df.index = df.index.astype(int)df.qty_liter = df.qty_liter.astype(float)

Then do groupby() again. It should be much faster. If it is, see if you can modify your data loading step to have the proper dtypes from the beginning.


Your data is classified into too many categories, which is the main reason that makes the groupby code too slow. I tried using Bodo to see how it would do with the groupby on a large data set. I ran the code with regular sequential Pandas and parallelized Bodo. It took about 20 seconds for Pandas and only 5 seconds for Bodo to run. Bodo basically parallelizes your Pandas code automatically and allows you to run it on multiple processors, which you cannot do with native pandas. It is free for up to four cores: https://docs.bodo.ai/latest/source/install.html

Notes on data generation: I generated a relatively large dataset with 20 million rows and 18 numerical columns. To make the generated data more resemblant to your dataset, two other columns named “index” and “qty_liter” are added.

#data generationimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(20000000, 18), columns = list('ABCDEFGHIJKLMNOPQR'))df['index'] = np.random.randint(2147400000,2147500000,20000000).astype(str)df['qty_liter'] = np.random.randn(20000000)df.to_parquet("data.pq")

With Regular Pandas:

import timeimport pandas as pdimport numpy as npstart = time.time()df = pd.read_parquet("data.pq")grouped = df.groupby(['index'])['qty_liter'].sum()end = time.time()print("computation time: ", end - start)print(grouped.head())output:computation time:  19.29292106628418index2147400000    29.7010942147400001    -7.1640312147400002   -21.1041172147400003     7.3151272147400004   -12.661605Name: qty_liter, dtype: float64

With Bodo:

%%pximport numpy as npimport pandas as pdimport timeimport bodo@bodo.jit(distributed = ['df'])def group_by():    start = time.time()    df = pd.read_parquet("data.pq")    df = df.groupby(['index'])['qty_liter'].sum()    end = time.time()    print("computation time: ", end - start)    print(df.head())    return df    df = group_by()output:[stdout:0] computation time:  5.12944599299226index2147437531     6.9755702147456463     1.7292122147447371    26.3581582147407055    -6.8856632147454784    -5.721883Name: qty_liter, dtype: float64

Disclaimer: I am a data scientist advocate working at Bodo.ai


I do not use string, but integer values that define the groups. Still it is very slow: about 3 mins vs. a fraction of a second in Stata. The number of observations is about 113k, the number of groups defined by x, y, z is about 26k.

a= df.groupby(["x", "y", "z"])["b"].describe()[['max']]

x,y,z: integer values

b: real value