Pandas-style transform of grouped data on PySpark DataFrame

python pandas apache-spark pyspark apache-spark-sql

I understand, each category requires a full scan of the DataFrame.

No it doesn't. DataFrame aggregations are performed using a logic similar to aggregateByKey. See DataFrame groupBy behaviour/optimization A slower part is join which requires sorting / shuffling. But it still doesn't require scan per group.

If this is an exact code you use it is slow because you don't provide a join expression. Because of that it simply performs a Cartesian product. So it is not only inefficient but also incorrect. You want something like this:

from pyspark.sql.functions import colmeans = df.groupBy("Category").mean("Values").alias("means")df.alias("df").join(means, col("df.Category") == col("means.Category"))

I think (but have not verified) that I can speed this up a great deal if I collect the result of the group-by/mean into a dictionary, and then use that dictionary in a UDF

It is possible although performance will vary on case by case basis. A problem with using Python UDFs is that it has to move data to and from Python. Still, it is definitely worth trying. You should consider using a broadcast variable for nameToMean though.

Is there an idiomatic way to express this type of operation without sacrificing performance?

In PySpark 1.6 you can use broadcast function:

df.alias("df").join(    broadcast(means), col("df.Category") == col("means.Category"))

but it is not available in <= 1.5.

python pandas apache-spark pyspark apache-spark-sql

You can use Window to do this

i.e.

import pyspark.sql.functions as Ffrom pyspark.sql.window import Windowwindow_var = Window().partitionBy('Categroy')df = df.withColumn('DemeanedValues', F.col('Values') - F.mean('Values').over(window_var))

python pandas apache-spark pyspark apache-spark-sql

Actually, there is an idiomatic way to do this in Spark, using the Hive OVER expression.

i.e.

df.registerTempTable('df')with_category_means = sqlContext.sql('select *, mean(Values) OVER (PARTITION BY Category) as category_mean from df')

Under the hood, this is using a window function. I'm not sure if this is faster than your solution, though

CodeHunter

Pandas-style transform of grouped data on PySpark DataFrame

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last