Merge Spark Dataframe rows based on key

Question:

Using pivot in pyspark I was able to get the below values. Note that columns T1..T4 are dynamically generated from pivot output, therefore cannot predict if there will be more or less.

+--------------------+-----------+----------------+-------------+-------------+
|   ID               |T1         |          T2    | T3          |        T4   |
+--------------------+-----------+----------------+-------------+-------------+
|15964021641455171213|   0.000000|             0.0|        0E-10|23.1500000000|
|15964021641455171213|  55.560000|40.7440000000002|18.5200000000|        0E-10|
+--------------------+-----------+----------------+-------------+-------------+

Expected Result:

+--------------------+-----------+----------------+-------------+-------------+
|   ID               |T1         |          T2    | T3          |        T4   |
+--------------------+-----------+----------------+-------------+-------------+
|15964021641455171213|  55.560000|40.7440000000002|18.5200000000|23.1500000000|
+--------------------+-----------+----------------+-------------+-------------+

Any help is appreciated !

Asked By: losforword

||

Answers:

The operation is a simple groupBy, with a sum as aggregation function. The main issue is here that the names and number of columns to be summed up are unknown. Therefore the aggregation columns have to be calculated dynamically:

from pyspark.sql import functions as F

df=...
non_id_cols=df.columns
non_id_cols.remove('ID')
summed_non_id_cols=[F.sum(c).alias(c) for c in non_id_cols]
df.groupBy('ID').agg(*summed_non_id_cols).show()
Answered By: werner
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.