How to find median and quantiles using Spark How to find median and quantiles using Spark python python

How to find median and quantiles using Spark


Ongoing work

SPARK-30569 - Add DSL functions invoking percentile_approx

Spark 2.0+:

You can use approxQuantile method which implements Greenwald-Khanna algorithm:

Python:

df.approxQuantile("x", [0.5], 0.25)

Scala:

df.stat.approxQuantile("x", Array(0.5), 0.25)

where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation.

Since Spark 2.2 (SPARK-14352) it supports estimation on multiple columns:

df.approxQuantile(["x", "y", "z"], [0.5], 0.25)

and

df.approxQuantile(Array("x", "y", "z"), Array(0.5), 0.25)

Underlying methods can be also used in SQL aggregation (both global and groped) using approx_percentile function:

> SELECT approx_percentile(10.0, array(0.5, 0.4, 0.1), 100); [10.0,10.0,10.0]> SELECT approx_percentile(10.0, 0.5, 100); 10.0

Spark < 2.0

Python

As I've mentioned in the comments it is most likely not worth all the fuss. If data is relatively small like in your case then simply collect and compute median locally:

import numpy as npnp.random.seed(323)rdd = sc.parallelize(np.random.randint(1000000, size=700000))%time np.median(rdd.collect())np.array(rdd.collect()).nbytes

It takes around 0.01 second on my few years old computer and around 5.5MB of memory.

If data is much larger sorting will be a limiting factor so instead of getting an exact value it is probably better to sample, collect, and compute locally. But if you really want a to use Spark something like this should do the trick (if I didn't mess up anything):

from numpy import floorimport timedef quantile(rdd, p, sample=None, seed=None):    """Compute a quantile of order p ∈ [0, 1]    :rdd a numeric rdd    :p quantile(between 0 and 1)    :sample fraction of and rdd to use. If not provided we use a whole dataset    :seed random number generator seed to be used with sample    """    assert 0 <= p <= 1    assert sample is None or 0 < sample <= 1    seed = seed if seed is not None else time.time()    rdd = rdd if sample is None else rdd.sample(False, sample, seed)    rddSortedWithIndex = (rdd.        sortBy(lambda x: x).        zipWithIndex().        map(lambda (x, i): (i, x)).        cache())    n = rddSortedWithIndex.count()    h = (n - 1) * p    rddX, rddXPlusOne = (        rddSortedWithIndex.lookup(x)[0]        for x in int(floor(h)) + np.array([0L, 1L]))    return rddX + (h - floor(h)) * (rddXPlusOne - rddX)

And some tests:

np.median(rdd.collect()), quantile(rdd, 0.5)## (500184.5, 500184.5)np.percentile(rdd.collect(), 25), quantile(rdd, 0.25)## (250506.75, 250506.75)np.percentile(rdd.collect(), 75), quantile(rdd, 0.75)(750069.25, 750069.25)

Finally lets define median:

from functools import partialmedian = partial(quantile, p=0.5)

So far so good but it takes 4.66 s in a local mode without any network communication. There is probably way to improve this, but why even bother?

Language independent (Hive UDAF):

If you use HiveContext you can also use Hive UDAFs. With integral values:

rdd.map(lambda x: (float(x), )).toDF(["x"]).registerTempTable("df")sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df")

With continuous values:

sqlContext.sql("SELECT percentile(x, 0.5) FROM df")

In percentile_approx you can pass an additional argument which determines a number of records to use.


Here is the method I used using window functions (with pyspark 2.2.0).

from pyspark.sql import DataFrameclass median():    """ Create median class with over method to pass partition """    def __init__(self, df, col, name):        assert col        self.column=col        self.df = df        self.name = name    def over(self, window):        from pyspark.sql.functions import percent_rank, pow, first        first_window = window.orderBy(self.column)                                  # first, order by column we want to compute the median for        df = self.df.withColumn("percent_rank", percent_rank().over(first_window))  # add percent_rank column, percent_rank = 0.5 coressponds to median        second_window = window.orderBy(pow(df.percent_rank-0.5, 2))                 # order by (percent_rank - 0.5)^2 ascending        return df.withColumn(self.name, first(self.column).over(second_window))     # the first row of the window corresponds to mediandef addMedian(self, col, median_name):    """ Method to be added to spark native DataFrame class """    return median(self, col, median_name)# Add method to DataFrame classDataFrame.addMedian = addMedian

Then call the addMedian method to calculate the median of col2:

from pyspark.sql import Windowmedian_window = Window.partitionBy("col1")df = df.addMedian("col2", "median").over(median_window)

Finally you can group by if needed.

df.groupby("col1", "median")


Adding a solution if you want an RDD method only and dont want to move to DF.This snippet can get you a percentile for an RDD of double.

If you input percentile as 50, you should obtain your required median.Let me know if there are any corner cases not accounted for.

/**  * Gets the nth percentile entry for an RDD of doubles  *  * @param inputScore : Input scores consisting of a RDD of doubles  * @param percentile : The percentile cutoff required (between 0 to 100), e.g 90%ile of [1,4,5,9,19,23,44] = ~23.  *                     It prefers the higher value when the desired quantile lies between two data points  * @return : The number best representing the percentile in the Rdd of double  */      def getRddPercentile(inputScore: RDD[Double], percentile: Double): Double = {    val numEntries = inputScore.count().toDouble    val retrievedEntry = (percentile * numEntries / 100.0 ).min(numEntries).max(0).toInt    inputScore      .sortBy { case (score) => score }      .zipWithIndex()      .filter { case (score, index) => index == retrievedEntry }      .map { case (score, index) => score }      .collect()(0)  }