How to find median and quantiles using Spark

Question:

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.

This question is similar to this question. However, the answer to the question is using Scala, which I do not know.

How can I calculate exact median with Apache Spark?

Using the thinking for the Scala answer, I am trying to write a similar answer in Python.

I know I first want to sort the RDD. I do not know how. I see the sortBy (Sorts this RDD by the given keyfunc) and sortByKey (Sorts this RDD, which is assumed to consist of (key, value) pairs.) methods. I think both use key value and my RDD only has integer elements.

  1. First, I was thinking of doing myrdd.sortBy(lambda x: x)?
  2. Next I will find the length of the rdd (rdd.count()).
  3. Finally, I want to find the element or 2 elements at the center of the rdd. I need help with this method too.

EDIT:

I had an idea. Maybe I can index my RDD and then key = index and value = element. And then I can try to sort by value? I don’t know if this is possible because there is only a sortByKey method.

Asked By: pr338

||

Answers:

Ongoing work

SPARK-30569Add DSL functions invoking percentile_approx

Spark 2.0+:

You can use approxQuantile method which implements Greenwald-Khanna algorithm:

Python:

df.approxQuantile("x", [0.5], 0.25)

Scala:

df.stat.approxQuantile("x", Array(0.5), 0.25)

where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation.

Since Spark 2.2 (SPARK-14352) it supports estimation on multiple columns:

df.approxQuantile(["x", "y", "z"], [0.5], 0.25)

and

df.approxQuantile(Array("x", "y", "z"), Array(0.5), 0.25)

Underlying methods can be also used in SQL aggregation (both global and groped) using approx_percentile function:

> SELECT approx_percentile(10.0, array(0.5, 0.4, 0.1), 100);
 [10.0,10.0,10.0]
> SELECT approx_percentile(10.0, 0.5, 100);
 10.0

Spark < 2.0

Python

As I’ve mentioned in the comments it is most likely not worth all the fuss. If data is relatively small like in your case then simply collect and compute median locally:

import numpy as np

np.random.seed(323)
rdd = sc.parallelize(np.random.randint(1000000, size=700000))

%time np.median(rdd.collect())
np.array(rdd.collect()).nbytes

It takes around 0.01 second on my few years old computer and around 5.5MB of memory.

If data is much larger sorting will be a limiting factor so instead of getting an exact value it is probably better to sample, collect, and compute locally. But if you really want a to use Spark something like this should do the trick (if I didn’t mess up anything):

from numpy import floor
import time

def quantile(rdd, p, sample=None, seed=None):
    """Compute a quantile of order p ∈ [0, 1]
    :rdd a numeric rdd
    :p quantile(between 0 and 1)
    :sample fraction of and rdd to use. If not provided we use a whole dataset
    :seed random number generator seed to be used with sample
    """
    assert 0 <= p <= 1
    assert sample is None or 0 < sample <= 1

    seed = seed if seed is not None else time.time()
    rdd = rdd if sample is None else rdd.sample(False, sample, seed)

    rddSortedWithIndex = (rdd.
        sortBy(lambda x: x).
        zipWithIndex().
        map(lambda (x, i): (i, x)).
        cache())

    n = rddSortedWithIndex.count()
    h = (n - 1) * p

    rddX, rddXPlusOne = (
        rddSortedWithIndex.lookup(x)[0]
        for x in int(floor(h)) + np.array([0L, 1L]))

    return rddX + (h - floor(h)) * (rddXPlusOne - rddX)

And some tests:

np.median(rdd.collect()), quantile(rdd, 0.5)
## (500184.5, 500184.5)
np.percentile(rdd.collect(), 25), quantile(rdd, 0.25)
## (250506.75, 250506.75)
np.percentile(rdd.collect(), 75), quantile(rdd, 0.75)
(750069.25, 750069.25)

Finally lets define median:

from functools import partial
median = partial(quantile, p=0.5)

So far so good but it takes 4.66 s in a local mode without any network communication. There is probably way to improve this, but why even bother?

Language independent (Hive UDAF):

If you use HiveContext you can also use Hive UDAFs. With integral values:

rdd.map(lambda x: (float(x), )).toDF(["x"]).registerTempTable("df")

sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df")

With continuous values:

sqlContext.sql("SELECT percentile(x, 0.5) FROM df")

In percentile_approx you can pass an additional argument which determines a number of records to use.

Answered By: zero323

Adding a solution if you want an RDD method only and dont want to move to DF.
This snippet can get you a percentile for an RDD of double.

If you input percentile as 50, you should obtain your required median.
Let me know if there are any corner cases not accounted for.

/**
  * Gets the nth percentile entry for an RDD of doubles
  *
  * @param inputScore : Input scores consisting of a RDD of doubles
  * @param percentile : The percentile cutoff required (between 0 to 100), e.g 90%ile of [1,4,5,9,19,23,44] = ~23.
  *                     It prefers the higher value when the desired quantile lies between two data points
  * @return : The number best representing the percentile in the Rdd of double
  */    
  def getRddPercentile(inputScore: RDD[Double], percentile: Double): Double = {
    val numEntries = inputScore.count().toDouble
    val retrievedEntry = (percentile * numEntries / 100.0 ).min(numEntries).max(0).toInt


    inputScore
      .sortBy { case (score) => score }
      .zipWithIndex()
      .filter { case (score, index) => index == retrievedEntry }
      .map { case (score, index) => score }
      .collect()(0)
  }
Answered By: Vedant

Here is the method I used using window functions (with pyspark 2.2.0).

from pyspark.sql import DataFrame

class median():
    """ Create median class with over method to pass partition """
    def __init__(self, df, col, name):
        assert col
        self.column=col
        self.df = df
        self.name = name

    def over(self, window):
        from pyspark.sql.functions import percent_rank, pow, first

        first_window = window.orderBy(self.column)                                  # first, order by column we want to compute the median for
        df = self.df.withColumn("percent_rank", percent_rank().over(first_window))  # add percent_rank column, percent_rank = 0.5 coressponds to median
        second_window = window.orderBy(pow(df.percent_rank-0.5, 2))                 # order by (percent_rank - 0.5)^2 ascending
        return df.withColumn(self.name, first(self.column).over(second_window))     # the first row of the window corresponds to median

def addMedian(self, col, median_name):
    """ Method to be added to spark native DataFrame class """
    return median(self, col, median_name)

# Add method to DataFrame class
DataFrame.addMedian = addMedian

Then call the addMedian method to calculate the median of col2:

from pyspark.sql import Window

median_window = Window.partitionBy("col1")
df = df.addMedian("col2", "median").over(median_window)

Finally you can group by if needed.

df.groupby("col1", "median")
Answered By: Benoît Carne

I have written the function which takes data frame as an input and returns a dataframe which has median as an output over a partition and order_col is the column for which we want to calculate median for part_col is the level at which we want to calculate median for :

from pyspark.sql import Window
import pyspark.sql.functions as F

def calculate_median(dataframe, part_col, order_col):
    win = Window.partitionBy(*part_col).orderBy(order_col)
#     count_row = dataframe.groupby(*part_col).distinct().count()
    dataframe.persist()
    dataframe.count()
    temp = dataframe.withColumn("rank", F.row_number().over(win))
    temp = temp.withColumn(
        "count_row_part",
        F.count(order_col).over(Window.partitionBy(part_col))
    )
    temp = temp.withColumn(
        "even_flag",
        F.when(
            F.col("count_row_part") %2 == 0,
            F.lit(1)
        ).otherwise(
            F.lit(0)
        )
    ).withColumn(
        "mid_value",
        F.floor(F.col("count_row_part")/2)
    )

    temp = temp.withColumn(
        "avg_flag",
        F.when(
            (F.col("even_flag")==1) &
            (F.col("rank") == F.col("mid_value"))|
            ((F.col("rank")-1) == F.col("mid_value")),
            F.lit(1)
        ).otherwise(
        F.when(
            F.col("rank") == F.col("mid_value")+1,
            F.lit(1)
            )
        )
    )
    temp.show(10)
    return temp.filter(
        F.col("avg_flag") == 1
    ).groupby(
        part_col + ["avg_flag"]
    ).agg(
        F.avg(F.col(order_col)).alias("median")
    ).drop("avg_flag")
Answered By: Ankit Kumar Namdeo

There are two ways that can be used. One is using approxQuantile method and the other percentile_approx method. However, both the methods might not give accurate results when there are even number of records. The workaround is to take average as shown below.

importpyspark.sql.functions.percentile_approx as F
# df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5).alias("MEDIAN)) # might not give proper results when there are even number of records

((
df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5) + df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.500001)
)*.5).alias("MEDIAN))
Answered By: prashanth

For exact median computation you can use the following function and use it with PySpark DataFrame API:

def median_exact(col: Union[Column, str]) -> Column:
    """
    For grouped aggregations, Spark provides a way via pyspark.sql.functions.percentile_approx("col", .5) function,
    since for large datasets, computing the median is computationally expensive.
    This function manually computes the median and should only be used for small to mid sized datasets / groupings.
    :param col: Column to compute the median for.
    :return: A pyspark `Column` containing the median calculation expression
    """
    list_expr = F.filter(F.collect_list(col), lambda x: x.isNotNull())
    sorted_list_expr = F.sort_array(list_expr)
    size_expr = F.size(sorted_list_expr)

    even_num_elements = (size_expr % 2) == 0
    odd_num_elements = ~even_num_elements

    return F.when(size_expr == 0, None).otherwise(
        F.when(odd_num_elements, sorted_list_expr[F.floor(size_expr / 2)]).otherwise(
            (
                sorted_list_expr[(size_expr / 2 - 1).cast("long")]
                + sorted_list_expr[(size_expr / 2).cast("long")]
            )
            / 2
        )
    )

Apply it like this:

output_df = input_spark_df.groupby("group").agg(
    median_exact("elems").alias("elems_median")
)
Answered By: J. Seegler

We can calculate the median and quantiles in spark using the following code:

df.stat.approxQuantile(col,[quantiles],error)

For example, finding the median in the following dataframe [1,2,3,4,5]:

df.stat.approxQuantile(col,[0.5],0)

The lesser the error, the more accurate the results.

Answered By: Arun Nandam

From version 3.4+ (and also already in 3.3.1) the median function is directly available
https://github.com/apache/spark/blob/e170a2eb236a376b036730b5d63371e753f1d947/python/pyspark/sql/functions.py#L633

import pyspark.sql.functions as f

df.groupBy("grp").agg(f.median("val"))

I guess the respective documentation will be added if the version is finally released.

Answered By: Jan_ewazz