Pyspark: Add the average as a new column to DataFrame

Question:

I am computing mean of a column in data-frame but it resulted in all the values zeros. Can someone help me in why this is happening? Following is the code and table before and after the transformation of a column.

Before computing mean and adding "mean" column

result.select("dis_price_released").show(10)
 +------------------+
 |dis_price_released|
 +------------------+
 |               0.0|
 |               4.0|
 |               4.0|
 |               4.0|
 |               1.0|
 |               4.0|
 |               4.0|
 |               0.0|
 |               4.0|
 |               0.0|
 +------------------+

After computing mean and adding mean column

w = Window().partitionBy("dis_price_released").rowsBetween(-sys.maxsize, sys.maxsize)
df2 = result.withColumn("mean", avg("dis_price_released").over(w))
df2.select("dis_price_released", "mean").show(10)

+------------------+----+
|dis_price_released|mean|
+------------------+----+
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
|               0.0| 0.0|
+------------------+----+
Asked By: thetna

||

Answers:

You can compute the avg first for the whole column, then use lit() to add it as a variable to your DataFrame, there is no need for window functions:

from pyspark.sql.functions import lit

mean = df.groupBy().avg("dis_price_released").take(1)[0][0]
df.withColumn("test", lit(mean)).show()
 +------------------+----+
|dis_price_released|test|
+------------------+----+
|               0.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               1.0| 2.5|
|               4.0| 2.5|
|               4.0| 2.5|
|               0.0| 2.5|
|               4.0| 2.5|
|               0.0| 2.5|
+------------------+----+
Answered By: mtoto

This is yet another way to solve the problem

df.withColumn("mean", lit(df.select(avg("dis_price_released").as("temp")).first().getAs("temp"))).show
Answered By: Ramesh Maharjan

The issue is that if you have a column you wish to calculate an average for across all rows, you should not partition by any column at all. Neither do you need to set rowsBetween in this case. Thus (assuming correct imports and the existence of the results DataFrame) your code should read:

w = Window().partitionBy()
df2 = result.withColumn("mean", avg("dis_price_released").over(w))
df2.select("dis_price_released", "mean").show(10)
Answered By: Justin
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.