Applying Python function to Pandas grouped DataFrame - what's the most efficient approach to speed up the computations?

python pandas apache-spark parallel-processing dask

Q : "Could either of my approaches be adjusted to eliminate possible bottlenecks and improve the performance?_{( e.g. PySpark setup, adjusting sub-optimal operations etc. )}"

_{+1 for mentioning the setup add-on overhead costs for either strategy of computing. This always makes a break-even point, only after which a non-[SERIAL] strategy may achieve any beneficial joy of some wished-to-have [TIME]-Domain speedup ( yet, if other, typically [SPACE]-Domain costs permit or stay feasible - yes, RAM ... existence of & access to such a sized device, budget and other similar real-world constraints )}

First,
the pre-flight check, before we take-off
The new, overhead-strict formulation of Amdahl's Law is currently able to incorporate both of these add-on pSO + pTO overheads and reflects these in predicting the achievable Speedup-levels including the break-even point, since which it may become meaningful ( in a costs/effect, efficiency sense ) to go parallel.

Yet,
that is not our core problem here.
This comes next :

Next,
given the computational costs of SpectralClustering(), which is going here to use the Radial Boltzmann Function kernel ~ exp( -gamma * distance( data, data )**2 ) there seems to be no advance from split of data-object over any number of disjunct work-units, as the distance( data, data )-component, by definition, has but to visit all the data-elements ( ref. the communication costs of any-to-any value-passing { process | node }-distributed topologies are, for obvious reasons, awfully bad if not the worst use-cases for { process | node }-distributed processing, if not the straight anti-patterns ( except for some indeed arcane, memory-less / state-less, yet computing fabrics ).

For pedantic analysts, yes - add to this ( and we may already say a bad state ) the costs of -again- any-to-any k-means-processing, here about O( N^( 1 + 5 * 5 ) ) that goes, for N ~ len( data ) ~ 1.12E6+, awfully against our wish to have some smart and fast processing.

So what?

While the setup costs are not neglected, the increased communication costs will almost for sure disable any improvement from using the above sketched attempts to move from a pure-[SERIAL] process flow into some form of just-[CONCURRENT] or True-[PARALLEL] orchestration of some work-sub-units, due to increased overheads related to a must to implement ( a tandem pair of ) any-to-any value-passing topologies.

If it weren't for 'em?

Well, this sounds as a Computing Science oxymoron - even if it were possible, the costs of the any-to-any pre-computed distances ( which would take those immense [TIME]-Domain complexity costs "beforehand" ( Where? How? Is there any other, un-avoidable latency, permitting a possible latency masking by some ( unknown so far ) incremental buildup of a complete-in-future any-to-any distance matrix? ) ) would but reposition these principally present costs to some other location in [TIME]- and [SPACE]-Domains, not reduce 'em.

Q : "Are they any better alternatives?"

The only one, I am aware off so far, is to try, if the problem is possible to get re-formulated into another, a QUBO-formulated, problem fashion ( ref.: Quantum-Unconstrained-Binary-Optimisation, good news is that tools for doing so, a base of first-hand knowledge and practical problem-solving experience exist and grow larger )

Q : How do they compare to the provided solutions in terms of performance?

The performance is breathtaking - QUBO-formulated problem has a promising O(1) (!) solver in constant time ( in [TIME]-Domain ) and somewhat restricted in [SPACE]-Domain ( where recently announced LLNL tricks may help avoid this physical world, current QPU implementation, constraint of problem sizes ).

python pandas apache-spark parallel-processing dask

This is not an answer, but...

If you run

df.groupby(['measurement_id', 'time', 'group']).apply(    lambda x: cluster(x['var'], x['object']))

(i.e., with Pandas alone), you will notice that you are already using several cores. This is because sklearn uses joblib by default to parallelise the work. You can swap out the scheduler in favour of Dask and perhaps get more efficiency over sharing the data between threads, but so long as the work you are doing is CPU-bound like this, there will be nothing you can do to speed it up.

In short, this is an algorithm problem: figure out what you really need to compute, before trying to consider different frameworks for computing it.

python pandas apache-spark parallel-processing dask

I am not an expert on Dask, but I provide the following code as a baseline:

import dask.dataframe as ddfdf = ddf.from_pandas(df, npartitions=4) # My PC has 4 corestask = df.groupby(["measurement_id", "time", "group"]).apply(    lambda x: cluster(x["var"], x["object"]),    meta=pd.Series(np.nan, index=pd.Series([0, 1, 1, 1])),)res = task.compute()

CodeHunter

Applying Python function to Pandas grouped DataFrame - what's the most efficient approach to speed up the computations?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last