Merge Spark Dataframe rows based on key

Question

Using pivot in pyspark I was able to get the below values. Note that columns T1..T4 are dynamically generated from pivot output, therefore cannot predict if there will be more or less.

+--------------------+-----------+----------------+-------------+-------------+
|   ID               |T1         |          T2    | T3          |        T4   |
+--------------------+-----------+----------------+-------------+-------------+
|15964021641455171213|   0.000000|             0.0|        0E-10|23.1500000000|
|15964021641455171213|  55.560000|40.7440000000002|18.5200000000|        0E-10|
+--------------------+-----------+----------------+-------------+-------------+

Expected Result:

+--------------------+-----------+----------------+-------------+-------------+
|   ID               |T1         |          T2    | T3          |        T4   |
+--------------------+-----------+----------------+-------------+-------------+
|15964021641455171213|  55.560000|40.7440000000002|18.5200000000|23.1500000000|
+--------------------+-----------+----------------+-------------+-------------+

Any help is appreciated !

Asked By: losforword

||

Source

Answer 1

The operation is a simple groupBy, with a sum as aggregation function. The main issue is here that the names and number of columns to be summed up are unknown. Therefore the aggregation columns have to be calculated dynamically:

from pyspark.sql import functions as F

df=...
non_id_cols=df.columns
non_id_cols.remove('ID')
summed_non_id_cols=[F.sum(c).alias(c) for c in non_id_cols]
df.groupBy('ID').agg(*summed_non_id_cols).show()

Answered By: werner

Merge Spark Dataframe rows based on key

Question:

Answers: