Get Geometric Mean Over Window in Pyspark Dataframe

Question

I have the following pyspark dataframe

Car	Time	Val1
1	1	3
2	1	6
3	1	8
1	2	10
2	2	21
3	2	33

I want to get the geometric mean of all the cars at each time, resulting df should look like this:

time	geo_mean
1	5.2414827884178
2	19.065333718304

I know how to calculate the arithmetic average with the following code:


from pyspark.sql import functions as F

df = df.withColumn(
        "aritmethic_average",
        F.avg(F.col("Val1")).over(W.partitionBy("time"))
    )

But I’m unsure how to accomplish the same thing with geometric means.

Thanks in advance!

Asked By: DataScience99

||

Source

Answer 1

You can try this. First get product of all values in the same group, then get the Xth’s root where X is the number of rows in the same group. And Xth’s root = power of 1/X

df = df.groupby('Time').agg(F.pow(F.product('Val1'), 1/F.count('Val1')))

ref: https://www.mathsisfun.com/numbers/geometric-mean.html

Answered By: Emma

Answer 2

Using the standard definition of the geometric mean might lead to very large numbers during the calculation.

Using the equivalent formula might be better if the groups become larger:

from pyspark.sql import functions as F

df.withColumn('ln_val1', F.log('Val1')) 
    .groupBy('Time') 
    .mean('ln_val1') 
    .withColumn('geo_mean', F.exp('avg(ln_val1)')) 
    .drop('avg(ln_val1)') 
    .show()

Result:

+----+-----------------+
|Time|         geo_mean|
+----+-----------------+
|   1|5.241482788417792|
|   2|19.06533371830357|
+----+-----------------+

Answered By: werner

Get Geometric Mean Over Window in Pyspark Dataframe

Question:

Answers: