Column alias after groupBy in pyspark
Question:
I need the resulting data frame in the line below, to have an alias name “maxDiff” for the max(‘diff’) column after groupBy. However, the below line does not makeany change, nor throw an error.
grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff")
Answers:
This is because you are aliasing the whole DataFrame
object, not Column
. Here’s an example how to alias the Column
only:
import pyspark.sql.functions as func
grpdf = joined_df
.groupBy(temp1.datestamp)
.max('diff')
.select(func.col("max(diff)").alias("maxDiff"))
You can use agg
instead of calling max
method:
from pyspark.sql.functions import max
joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))
Similarly in Scala
import org.apache.spark.sql.functions.max
joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))
or
joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))
In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don’t have to import from pyspark.sql.functions
:
1
grouped_df = joined_df.groupBy(temp1.datestamp)
.max('diff')
.selectExpr('max(diff) AS maxDiff')
See docs for info on .selectExpr()
2
grouped_df = joined_df.groupBy(temp1.datestamp)
.max('diff')
.withColumnRenamed('max(diff)', 'maxDiff')
See docs for info on .withColumnRenamed()
This answer here goes into more detail: https://stackoverflow.com/a/34077809
you can use.
grouped_df = grpdf.select(col("max(diff)") as "maxdiff",col("sum(DIFF)") as "sumdiff").show()
I need the resulting data frame in the line below, to have an alias name “maxDiff” for the max(‘diff’) column after groupBy. However, the below line does not makeany change, nor throw an error.
grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff")
This is because you are aliasing the whole DataFrame
object, not Column
. Here’s an example how to alias the Column
only:
import pyspark.sql.functions as func
grpdf = joined_df
.groupBy(temp1.datestamp)
.max('diff')
.select(func.col("max(diff)").alias("maxDiff"))
You can use agg
instead of calling max
method:
from pyspark.sql.functions import max
joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))
Similarly in Scala
import org.apache.spark.sql.functions.max
joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))
or
joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))
In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don’t have to import from pyspark.sql.functions
:
1
grouped_df = joined_df.groupBy(temp1.datestamp)
.max('diff')
.selectExpr('max(diff) AS maxDiff')
See docs for info on .selectExpr()
2
grouped_df = joined_df.groupBy(temp1.datestamp)
.max('diff')
.withColumnRenamed('max(diff)', 'maxDiff')
See docs for info on .withColumnRenamed()
This answer here goes into more detail: https://stackoverflow.com/a/34077809
you can use.
grouped_df = grpdf.select(col("max(diff)") as "maxdiff",col("sum(DIFF)") as "sumdiff").show()