most efficient method to use pandas pivot table over large file

python multithreading python-3.x pandas multiprocessing

Syntax error

Your code has a number of syntactical errors

pool.submit(myfunc(folder), 1000)

The pool.submit method takes a function as a first argument.

From what I see your function myfunc does not return anything, and definitely not a function.

Even so, from my understanding, you are trying to launch 1000 workers who all read the same folder and then creates dataframes.

Parallelization problem

In any threading scenario, the number of workers should be close to the number of cores available on the machine you are running. This is common sense, I will not quote anything.

Spawning 1000 workers is a lot of overhead and is a probable source of your slow function. Also all your workers seem to be doing the exact same thing, which of course means you do the same work 1000 times.

My guess at the actual pivot problem

So from what you write, code aside, I understand that you are trying to create a huge key-space that allows you to slice into any metric and drill down into the dataset.

You are doing this using a single column from what I see. You should be splitting these out into separate columns. As hinted by commenters, pandas has categorical columns that could be used, but even without them, the index for the key-space will be much smaller if the key parts are in separate columns. Your current dataset most likely has a separate key for almost each line, thus not aggregating more than a a few lines together, leaving the pivot table the same size as the original dataset.

TLDR;

Split your key column into multiple columns, preferably categorical ones.

CodeHunter

most efficient method to use pandas pivot table over large file

Syntax error

Parallelization problem

My guess at the actual pivot problem

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last