How to use window functions in PySpark?

python sql apache-spark pyspark window-functions

To be able to use window function you have to create a window first. Definition is pretty much the same as for normal SQL it means you can define either order, partition or both. First lets create some dummy data:

import numpy as npnp.random.seed(1)keys = ["foo"] * 10 + ["bar"] * 10values = np.hstack([np.random.normal(0, 1, 10), np.random.normal(10, 1, 100)])df = sqlContext.createDataFrame([   {"k": k, "v": round(float(v), 3)} for k, v in zip(keys, values)])

Make sure you're using HiveContext (Spark < 2.0 only):

from pyspark.sql import HiveContextassert isinstance(sqlContext, HiveContext)

Create a window:

from pyspark.sql.window import Windoww =  Window.partitionBy(df.k).orderBy(df.v)

which is equivalent to

(PARTITION BY k ORDER BY v)

in SQL.

As a rule of thumb window definitions should always contain PARTITION BY clause otherwise Spark will move all data to a single partition. ORDER BY is required for some functions, while in different cases (typically aggregates) may be optional.

There are also two optional which can be used to define window span - ROWS BETWEEN and RANGE BETWEEN. These won't be useful for us in this particular scenario.

Finally we can use it for a query:

from pyspark.sql.functions import percentRank, ntiledf.select(    "k", "v",    percentRank().over(w).alias("percent_rank"),    ntile(3).over(w).alias("ntile3"))

Note that ntile is not related in any way to the quantiles.

CodeHunter

How to use window functions in PySpark?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last