Get Geometric Mean Over Window in Pyspark Dataframe

Question:

I have the following pyspark dataframe

Car Time Val1
1 1 3
2 1 6
3 1 8
1 2 10
2 2 21
3 2 33

I want to get the geometric mean of all the cars at each time, resulting df should look like this:

time geo_mean
1 5.2414827884178
2 19.065333718304

I know how to calculate the arithmetic average with the following code:


from pyspark.sql import functions as F

df = df.withColumn(
        "aritmethic_average",
        F.avg(F.col("Val1")).over(W.partitionBy("time"))
    )

But I’m unsure how to accomplish the same thing with geometric means.

Thanks in advance!

Asked By: DataScience99

||

Answers:

You can try this. First get product of all values in the same group, then get the Xth’s root where X is the number of rows in the same group. And Xth’s root = power of 1/X

df = df.groupby('Time').agg(F.pow(F.product('Val1'), 1/F.count('Val1')))

ref: https://www.mathsisfun.com/numbers/geometric-mean.html

Answered By: Emma

Using the standard definition of the geometric mean standard definition might lead to very large numbers during the calculation.

Using the equivalent formula geomeanformula might be better if the groups become larger:

from pyspark.sql import functions as F

df.withColumn('ln_val1', F.log('Val1')) 
    .groupBy('Time') 
    .mean('ln_val1') 
    .withColumn('geo_mean', F.exp('avg(ln_val1)')) 
    .drop('avg(ln_val1)') 
    .show()    

Result:

+----+-----------------+
|Time|         geo_mean|
+----+-----------------+
|   1|5.241482788417792|
|   2|19.06533371830357|
+----+-----------------+
Answered By: werner
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.