Get Geometric Mean Over Window in Pyspark Dataframe


I have the following pyspark dataframe

Car Time Val1
1 1 3
2 1 6
3 1 8
1 2 10
2 2 21
3 2 33

I want to get the geometric mean of all the cars at each time, resulting df should look like this:

time geo_mean
1 5.2414827884178
2 19.065333718304

I know how to calculate the arithmetic average with the following code:

from pyspark.sql import functions as F

df = df.withColumn(

But I’m unsure how to accomplish the same thing with geometric means.

Thanks in advance!

Asked By: DataScience99



You can try this. First get product of all values in the same group, then get the Xth’s root where X is the number of rows in the same group. And Xth’s root = power of 1/X

df = df.groupby('Time').agg(F.pow(F.product('Val1'), 1/F.count('Val1')))


Answered By: Emma

Using the standard definition of the geometric mean standard definition might lead to very large numbers during the calculation.

Using the equivalent formula geomeanformula might be better if the groups become larger:

from pyspark.sql import functions as F

df.withColumn('ln_val1', F.log('Val1')) 
    .withColumn('geo_mean', F.exp('avg(ln_val1)')) 


|Time|         geo_mean|
|   1|5.241482788417792|
|   2|19.06533371830357|
Answered By: werner
