Pyspark: Add the average as a new column to DataFrame

Question

I am computing mean of a column in data-frame but it resulted in all the values zeros. Can someone help me in why this is happening? Following is the code and table before and after the transformation of a column.

Before computing mean and adding "mean" column

result.select("dis_price_released").show(10)
 +------------------+
 |dis_price_released|
 +------------------+
 |               0.0|
 |               4.0|
 |               4.0|
 |               4.0|
 |               1.0|
 |               4.0|
 |               4.0|
 |               0.0|
 |               4.0|
 |               0.0|
 +------------------+

After computing mean and adding mean column

w = Window().partitionBy("dis_price_released").rowsBetween(-sys.maxsize, sys.maxsize)
df2 = result.withColumn("mean", avg("dis_price_released").over(w))
df2.select("dis_price_released", "mean").show(10)

+------------------+----+
|dis_price_released|mean|
+------------------+----+
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
+------------------+----+

Asked By: thetna

||

Source

Answer 1

You can compute the avg first for the whole column, then use lit() to add it as a variable to your DataFrame, there is no need for window functions:

from pyspark.sql.functions import lit

mean = df.groupBy().avg("dis_price_released").take(1)[0][0]
df.withColumn("test", lit(mean)).show()
 +------------------+----+
|dis_price_released|test|
+------------------+----+
|               0.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               1.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               0.0| 2.5|
|               4.0| 2.5|
|               0.0| 2.5|
+------------------+----+

Answered By: mtoto

Answer 2

This is yet another way to solve the problem

df.withColumn("mean", lit(df.select(avg("dis_price_released").as("temp")).first().getAs("temp"))).show

Answered By: Ramesh Maharjan

Answer 3

The issue is that if you have a column you wish to calculate an average for across all rows, you should not partition by any column at all. Neither do you need to set rowsBetween in this case. Thus (assuming correct imports and the existence of the results DataFrame) your code should read:

w = Window().partitionBy()
df2 = result.withColumn("mean", avg("dis_price_released").over(w))
df2.select("dis_price_released", "mean").show(10)

Answered By: Justin

Pyspark: Add the average as a new column to DataFrame

Question:

Before computing mean and adding "mean" column

After computing mean and adding mean column

Answers: