How to pivot on multiple columns in Spark SQL?

Question:

I need to pivot more than one column in a PySpark dataframe. Sample dataframe:

from pyspark.sql import functions as F
d = [(100,1,23,10),(100,2,45,11),(100,3,67,12),(100,4,78,13),(101,1,23,10),(101,2,45,13),(101,3,67,14),(101,4,78,15),(102,1,23,10),(102,2,45,11),(102,3,67,16),(102,4,78,18)]
mydf = spark.createDataFrame(d,['id','day','price','units'])
mydf.show()
# +---+---+-----+-----+
# | id|day|price|units|
# +---+---+-----+-----+
# |100|  1|   23|   10|
# |100|  2|   45|   11|
# |100|  3|   67|   12|
# |100|  4|   78|   13|
# |101|  1|   23|   10|
# |101|  2|   45|   13|
# |101|  3|   67|   14|
# |101|  4|   78|   15|
# |102|  1|   23|   10|
# |102|  2|   45|   11|
# |102|  3|   67|   16|
# |102|  4|   78|   18|
# +---+---+-----+-----+t

Now, if I need to get price column into a row for each id based on day, then I can use pivot method:

pvtdf = mydf.withColumn('combcol', F.concat(F.lit('price_'), mydf['day'])).groupby('id').pivot('combcol').agg(F.first('price'))
pvtdf.show()
# +---+-------+-------+-------+-------+
# | id|price_1|price_2|price_3|price_4|
# +---+-------+-------+-------+-------+
# |100|     23|     45|     67|     78|
# |101|     23|     45|     67|     78|
# |102|     23|     45|     67|     78|
# +---+-------+-------+-------+-------+

So when I need units column as well to be transposed as price, I’ve got to create one more dataframe as above for units and then join both using "id". But, when I have more columns as such, I tried a function to do it,

def pivot_udf(df, *cols):
    mydf = df.select('id').drop_duplicates()
    for c in cols:
       mydf = mydf.join(df.withColumn('combcol', F.concat(F.lit('{}_'.format(c)), df['day'])).groupby('id').pivot('combcol').agg(F.first(c)),' id')
    return mydf

pivot_udf(mydf, 'price', 'units').show()
# +---+-------+-------+-------+-------+-------+-------+-------+-------+
# | id|price_1|price_2|price_3|price_4|units_1|units_2|units_3|units_4|
# +---+-------+-------+-------+-------+-------+-------+-------+-------+
# |100|     23|     45|     67|     78|     10|     11|     12|     13|
# |101|     23|     45|     67|     78|     10|     13|     14|     15|
# |102|     23|     45|     67|     78|     10|     11|     16|     18|
# +---+-------+-------+-------+-------+-------+-------+-------+-------+

Is it a good practice to do so and is there any other better way of doing it?

Asked By: Suresh

||

Answers:

As in spark 1.6 version I think that’s the only way because pivot takes only one column and there is second attribute values on which you can pass the distinct values of that column that will make your code run faster because otherwise spark has to run that for you, so yes that’s the right way to do it.

Answered By: Ankit Kumar Namdeo

The solution in the question is the best I could get. The only improvement would be to cache the input dataset to avoid double scan, i.e.

mydf.cache
pivot_udf(mydf,'price','units').show()
Answered By: Jacek Laskowski

Here’s a non-UDF way involving a single pivot (hence, just a single column scan to identify all the unique dates).

dff = mydf.groupBy('id').pivot('day').agg(F.first('price').alias('price'),F.first('units').alias('unit'))

Here’s the result (apologies for the non-matching ordering and naming):

+---+-------+------+-------+------+-------+------+-------+------+               
| id|1_price|1_unit|2_price|2_unit|3_price|3_unit|4_price|4_unit|
+---+-------+------+-------+------+-------+------+-------+------+
|100|     23|    10|     45|    11|     67|    12|     78|    13|
|101|     23|    10|     45|    13|     67|    14|     78|    15|
|102|     23|    10|     45|    11|     67|    16|     78|    18|
+---+-------+------+-------+------+-------+------+-------+------+

We just aggregate both on the price and the unit column after pivoting on the day.

If naming required as in question,

dff.select([F.col(c).name('_'.join(x for x in c.split('_')[::-1])) for c in dff.columns]).show()

+---+-------+------+-------+------+-------+------+-------+------+
| id|price_1|unit_1|price_2|unit_2|price_3|unit_3|price_4|unit_4|
+---+-------+------+-------+------+-------+------+-------+------+
|100|     23|    10|     45|    11|     67|    12|     78|    13|
|101|     23|    10|     45|    13|     67|    14|     78|    15|
|102|     23|    10|     45|    11|     67|    16|     78|    18|
+---+-------+------+-------+------+-------+------+-------+------+
Answered By: Jedi

This is the example showing how to group, pivot and aggregate using multiple columns for each.

It’s not straightforward that when pivoting on multiple columns, you first need to create one more column which should be used for pivoting.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('clsA', 'id1', 'a', 'x', 100, 15),
     ('clsA', 'id1', 'a', 'x', 110, 16),
     ('clsA', 'id1', 'a', 'y', 105, 14),
     ('clsA', 'id2', 'a', 'y', 110, 14),
     ('clsA', 'id1', 'b', 'y', 100, 13),
     ('clsA', 'id1', 'b', 'x', 120, 16),
     ('clsA', 'id2', 'b', 'y', 120, 17)],
    ['cls', 'id', 'grp1', 'grp2', 'price', 'units'])

Aggregation:

df = df.withColumn('_pivot', F.concat_ws('_', 'grp1', 'grp2'))
df = df.groupBy('cls', 'id').pivot('_pivot').agg(
    F.first('price').alias('price'),
    F.first('units').alias('unit')
)
df.show()
# +----+---+---------+--------+---------+--------+---------+--------+---------+--------+
# | cls| id|a_x_price|a_x_unit|a_y_price|a_y_unit|b_x_price|b_x_unit|b_y_price|b_y_unit|
# +----+---+---------+--------+---------+--------+---------+--------+---------+--------+
# |clsA|id2|     null|    null|      110|      14|     null|    null|      120|      17|
# |clsA|id1|      100|      15|      105|      14|      120|      16|      100|      13|
# +----+---+---------+--------+---------+--------+---------+--------+---------+--------+
Answered By: ZygD