Pandas: df.groupby() is too slow for big data set. Any alternatives methods?

python pandas grouping bigdata

The problem is that your data are not numeric. Processing strings takes a lot longer than processing numbers. Try this first:

df.index = df.index.astype(int)df.qty_liter = df.qty_liter.astype(float)

Then do groupby() again. It should be much faster. If it is, see if you can modify your data loading step to have the proper dtypes from the beginning.

python pandas grouping bigdata

Your data is classified into too many categories, which is the main reason that makes the groupby code too slow. I tried using Bodo to see how it would do with the groupby on a large data set. I ran the code with regular sequential Pandas and parallelized Bodo. It took about 20 seconds for Pandas and only 5 seconds for Bodo to run. Bodo basically parallelizes your Pandas code automatically and allows you to run it on multiple processors, which you cannot do with native pandas. It is free for up to four cores: https://docs.bodo.ai/latest/source/install.html

Notes on data generation: I generated a relatively large dataset with 20 million rows and 18 numerical columns. To make the generated data more resemblant to your dataset, two other columns named “index” and “qty_liter” are added.

#data generationimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(20000000, 18), columns = list('ABCDEFGHIJKLMNOPQR'))df['index'] = np.random.randint(2147400000,2147500000,20000000).astype(str)df['qty_liter'] = np.random.randn(20000000)df.to_parquet("data.pq")

With Regular Pandas:

import timeimport pandas as pdimport numpy as npstart = time.time()df = pd.read_parquet("data.pq")grouped = df.groupby(['index'])['qty_liter'].sum()end = time.time()print("computation time: ", end - start)print(grouped.head())output:computation time:  19.29292106628418index2147400000    29.7010942147400001    -7.1640312147400002   -21.1041172147400003     7.3151272147400004   -12.661605Name: qty_liter, dtype: float64

With Bodo:

%%pximport numpy as npimport pandas as pdimport timeimport bodo@bodo.jit(distributed = ['df'])def group_by():    start = time.time()    df = pd.read_parquet("data.pq")    df = df.groupby(['index'])['qty_liter'].sum()    end = time.time()    print("computation time: ", end - start)    print(df.head())    return df    df = group_by()output:[stdout:0] computation time:  5.12944599299226index2147437531     6.9755702147456463     1.7292122147447371    26.3581582147407055    -6.8856632147454784    -5.721883Name: qty_liter, dtype: float64

Disclaimer: I am a data scientist advocate working at Bodo.ai

python pandas grouping bigdata

I do not use string, but integer values that define the groups. Still it is very slow: about 3 mins vs. a fraction of a second in Stata. The number of observations is about 113k, the number of groups defined by x, y, z is about 26k.

a= df.groupby(["x", "y", "z"])["b"].describe()[['max']]

x,y,z: integer values

b: real value

CodeHunter

Pandas: df.groupby() is too slow for big data set. Any alternatives methods?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last