most efficient method to use pandas pivot table over large file most efficient method to use pandas pivot table over large file python-3.x python-3.x

most efficient method to use pandas pivot table over large file


Syntax error

Your code has a number of syntactical errors

pool.submit(myfunc(folder), 1000)

The pool.submit method takes a function as a first argument.

From what I see your function myfunc does not return anything, and definitely not a function.

Even so, from my understanding, you are trying to launch 1000 workers who all read the same folder and then creates dataframes.

Parallelization problem

In any threading scenario, the number of workers should be close to the number of cores available on the machine you are running. This is common sense, I will not quote anything.

Spawning 1000 workers is a lot of overhead and is a probable source of your slow function. Also all your workers seem to be doing the exact same thing, which of course means you do the same work 1000 times.

My guess at the actual pivot problem

So from what you write, code aside, I understand that you are trying to create a huge key-space that allows you to slice into any metric and drill down into the dataset.

You are doing this using a single column from what I see. You should be splitting these out into separate columns. As hinted by commenters, pandas has categorical columns that could be used, but even without them, the index for the key-space will be much smaller if the key parts are in separate columns. Your current dataset most likely has a separate key for almost each line, thus not aggregating more than a a few lines together, leaving the pivot table the same size as the original dataset.

TLDR;

Split your key column into multiple columns, preferably categorical ones.