# Get Geometric Mean Over Window in Pyspark Dataframe

## Question:

I have the following pyspark dataframe

Car Time Val1
1 1 3
2 1 6
3 1 8
1 2 10
2 2 21
3 2 33

I want to get the geometric mean of all the cars at each time, resulting df should look like this:

time geo_mean
1 5.2414827884178
2 19.065333718304

I know how to calculate the arithmetic average with the following code:

``````
from pyspark.sql import functions as F

df = df.withColumn(
"aritmethic_average",
F.avg(F.col("Val1")).over(W.partitionBy("time"))
)

``````

But I’m unsure how to accomplish the same thing with geometric means.

You can try this. First get product of all values in the same group, then get the Xth’s root where X is the number of rows in the same group. And Xth’s root = power of 1/X

``````df = df.groupby('Time').agg(F.pow(F.product('Val1'), 1/F.count('Val1')))
``````

Using the standard definition of the geometric mean might lead to very large numbers during the calculation.

Using the equivalent formula might be better if the groups become larger:

``````from pyspark.sql import functions as F

df.withColumn('ln_val1', F.log('Val1'))
.groupBy('Time')
.mean('ln_val1')
.withColumn('geo_mean', F.exp('avg(ln_val1)'))
.drop('avg(ln_val1)')
.show()
``````

Result:

``````+----+-----------------+
|Time|         geo_mean|
+----+-----------------+
|   1|5.241482788417792|
|   2|19.06533371830357|
+----+-----------------+
``````
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.