Strategy for partitioning dask dataframes efficiently

As of Dask 2.0.0 you may call .repartition(partition_size="100MB").

This method performs an object-considerate (.memory_usage(deep=True)) breakdown of partition size. It will join smaller partitions, or split partitions that have grown too large.

Dask's Documentation also outlines the usage.

python optimization dataframe dask

After discussion with mrocklin a decent strategy for partitioning is to aim for 100MB partition sizes guided by df.memory_usage().sum().compute(). With datasets that fit in RAM the additional work this might involve can be mitigated with use of df.persist() placed at relevant points.

python optimization dataframe dask

Just to add to Samantha Hughes' answer:

memory_usage() by default ignores memory consumption of object dtype columns. For the datasets I have been working with recently this leads to an underestimate of memory usage of about 10x.

Unless you are sure there are no object dtype columns I would suggest specifying deep=True, that is, repartition using:

df.repartition(npartitions= 1+df.memory_usage(deep=True).sum().compute() // n )

Where n is your target partition size in bytes. Adding 1 ensures the number of partitions is always greater than 1 (// performs floor division).

CodeHunter

Strategy for partitioning dask dataframes efficiently

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last