How to pivot on multiple columns in Spark SQL?
Question:
I need to pivot more than one column in a PySpark dataframe. Sample dataframe:
from pyspark.sql import functions as F
d = [(100,1,23,10),(100,2,45,11),(100,3,67,12),(100,4,78,13),(101,1,23,10),(101,2,45,13),(101,3,67,14),(101,4,78,15),(102,1,23,10),(102,2,45,11),(102,3,67,16),(102,4,78,18)]
mydf = spark.createDataFrame(d,['id','day','price','units'])
mydf.show()
# +---+---+-----+-----+
# | id|day|price|units|
# +---+---+-----+-----+
# |100| 1| 23| 10|
# |100| 2| 45| 11|
# |100| 3| 67| 12|
# |100| 4| 78| 13|
# |101| 1| 23| 10|
# |101| 2| 45| 13|
# |101| 3| 67| 14|
# |101| 4| 78| 15|
# |102| 1| 23| 10|
# |102| 2| 45| 11|
# |102| 3| 67| 16|
# |102| 4| 78| 18|
# +---+---+-----+-----+t
Now, if I need to get price column into a row for each id based on day, then I can use pivot
method:
pvtdf = mydf.withColumn('combcol', F.concat(F.lit('price_'), mydf['day'])).groupby('id').pivot('combcol').agg(F.first('price'))
pvtdf.show()
# +---+-------+-------+-------+-------+
# | id|price_1|price_2|price_3|price_4|
# +---+-------+-------+-------+-------+
# |100| 23| 45| 67| 78|
# |101| 23| 45| 67| 78|
# |102| 23| 45| 67| 78|
# +---+-------+-------+-------+-------+
So when I need units column as well to be transposed as price, I’ve got to create one more dataframe as above for units and then join
both using "id". But, when I have more columns as such, I tried a function to do it,
def pivot_udf(df, *cols):
mydf = df.select('id').drop_duplicates()
for c in cols:
mydf = mydf.join(df.withColumn('combcol', F.concat(F.lit('{}_'.format(c)), df['day'])).groupby('id').pivot('combcol').agg(F.first(c)),' id')
return mydf
pivot_udf(mydf, 'price', 'units').show()
# +---+-------+-------+-------+-------+-------+-------+-------+-------+
# | id|price_1|price_2|price_3|price_4|units_1|units_2|units_3|units_4|
# +---+-------+-------+-------+-------+-------+-------+-------+-------+
# |100| 23| 45| 67| 78| 10| 11| 12| 13|
# |101| 23| 45| 67| 78| 10| 13| 14| 15|
# |102| 23| 45| 67| 78| 10| 11| 16| 18|
# +---+-------+-------+-------+-------+-------+-------+-------+-------+
Is it a good practice to do so and is there any other better way of doing it?
Answers:
As in spark 1.6 version I think that’s the only way because pivot takes only one column and there is second attribute values on which you can pass the distinct values of that column that will make your code run faster because otherwise spark has to run that for you, so yes that’s the right way to do it.
The solution in the question is the best I could get. The only improvement would be to cache
the input dataset to avoid double scan, i.e.
mydf.cache
pivot_udf(mydf,'price','units').show()
Here’s a non-UDF way involving a single pivot (hence, just a single column scan to identify all the unique dates).
dff = mydf.groupBy('id').pivot('day').agg(F.first('price').alias('price'),F.first('units').alias('unit'))
Here’s the result (apologies for the non-matching ordering and naming):
+---+-------+------+-------+------+-------+------+-------+------+
| id|1_price|1_unit|2_price|2_unit|3_price|3_unit|4_price|4_unit|
+---+-------+------+-------+------+-------+------+-------+------+
|100| 23| 10| 45| 11| 67| 12| 78| 13|
|101| 23| 10| 45| 13| 67| 14| 78| 15|
|102| 23| 10| 45| 11| 67| 16| 78| 18|
+---+-------+------+-------+------+-------+------+-------+------+
We just aggregate both on the price
and the unit
column after pivoting on the day.
If naming required as in question,
dff.select([F.col(c).name('_'.join(x for x in c.split('_')[::-1])) for c in dff.columns]).show()
+---+-------+------+-------+------+-------+------+-------+------+
| id|price_1|unit_1|price_2|unit_2|price_3|unit_3|price_4|unit_4|
+---+-------+------+-------+------+-------+------+-------+------+
|100| 23| 10| 45| 11| 67| 12| 78| 13|
|101| 23| 10| 45| 13| 67| 14| 78| 15|
|102| 23| 10| 45| 11| 67| 16| 78| 18|
+---+-------+------+-------+------+-------+------+-------+------+
This is the example showing how to group, pivot and aggregate using multiple columns for each.
It’s not straightforward that when pivoting on multiple columns, you first need to create one more column which should be used for pivoting.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('clsA', 'id1', 'a', 'x', 100, 15),
('clsA', 'id1', 'a', 'x', 110, 16),
('clsA', 'id1', 'a', 'y', 105, 14),
('clsA', 'id2', 'a', 'y', 110, 14),
('clsA', 'id1', 'b', 'y', 100, 13),
('clsA', 'id1', 'b', 'x', 120, 16),
('clsA', 'id2', 'b', 'y', 120, 17)],
['cls', 'id', 'grp1', 'grp2', 'price', 'units'])
Aggregation:
df = df.withColumn('_pivot', F.concat_ws('_', 'grp1', 'grp2'))
df = df.groupBy('cls', 'id').pivot('_pivot').agg(
F.first('price').alias('price'),
F.first('units').alias('unit')
)
df.show()
# +----+---+---------+--------+---------+--------+---------+--------+---------+--------+
# | cls| id|a_x_price|a_x_unit|a_y_price|a_y_unit|b_x_price|b_x_unit|b_y_price|b_y_unit|
# +----+---+---------+--------+---------+--------+---------+--------+---------+--------+
# |clsA|id2| null| null| 110| 14| null| null| 120| 17|
# |clsA|id1| 100| 15| 105| 14| 120| 16| 100| 13|
# +----+---+---------+--------+---------+--------+---------+--------+---------+--------+
I need to pivot more than one column in a PySpark dataframe. Sample dataframe:
from pyspark.sql import functions as F
d = [(100,1,23,10),(100,2,45,11),(100,3,67,12),(100,4,78,13),(101,1,23,10),(101,2,45,13),(101,3,67,14),(101,4,78,15),(102,1,23,10),(102,2,45,11),(102,3,67,16),(102,4,78,18)]
mydf = spark.createDataFrame(d,['id','day','price','units'])
mydf.show()
# +---+---+-----+-----+
# | id|day|price|units|
# +---+---+-----+-----+
# |100| 1| 23| 10|
# |100| 2| 45| 11|
# |100| 3| 67| 12|
# |100| 4| 78| 13|
# |101| 1| 23| 10|
# |101| 2| 45| 13|
# |101| 3| 67| 14|
# |101| 4| 78| 15|
# |102| 1| 23| 10|
# |102| 2| 45| 11|
# |102| 3| 67| 16|
# |102| 4| 78| 18|
# +---+---+-----+-----+t
Now, if I need to get price column into a row for each id based on day, then I can use pivot
method:
pvtdf = mydf.withColumn('combcol', F.concat(F.lit('price_'), mydf['day'])).groupby('id').pivot('combcol').agg(F.first('price'))
pvtdf.show()
# +---+-------+-------+-------+-------+
# | id|price_1|price_2|price_3|price_4|
# +---+-------+-------+-------+-------+
# |100| 23| 45| 67| 78|
# |101| 23| 45| 67| 78|
# |102| 23| 45| 67| 78|
# +---+-------+-------+-------+-------+
So when I need units column as well to be transposed as price, I’ve got to create one more dataframe as above for units and then join
both using "id". But, when I have more columns as such, I tried a function to do it,
def pivot_udf(df, *cols):
mydf = df.select('id').drop_duplicates()
for c in cols:
mydf = mydf.join(df.withColumn('combcol', F.concat(F.lit('{}_'.format(c)), df['day'])).groupby('id').pivot('combcol').agg(F.first(c)),' id')
return mydf
pivot_udf(mydf, 'price', 'units').show()
# +---+-------+-------+-------+-------+-------+-------+-------+-------+
# | id|price_1|price_2|price_3|price_4|units_1|units_2|units_3|units_4|
# +---+-------+-------+-------+-------+-------+-------+-------+-------+
# |100| 23| 45| 67| 78| 10| 11| 12| 13|
# |101| 23| 45| 67| 78| 10| 13| 14| 15|
# |102| 23| 45| 67| 78| 10| 11| 16| 18|
# +---+-------+-------+-------+-------+-------+-------+-------+-------+
Is it a good practice to do so and is there any other better way of doing it?
As in spark 1.6 version I think that’s the only way because pivot takes only one column and there is second attribute values on which you can pass the distinct values of that column that will make your code run faster because otherwise spark has to run that for you, so yes that’s the right way to do it.
The solution in the question is the best I could get. The only improvement would be to cache
the input dataset to avoid double scan, i.e.
mydf.cache
pivot_udf(mydf,'price','units').show()
Here’s a non-UDF way involving a single pivot (hence, just a single column scan to identify all the unique dates).
dff = mydf.groupBy('id').pivot('day').agg(F.first('price').alias('price'),F.first('units').alias('unit'))
Here’s the result (apologies for the non-matching ordering and naming):
+---+-------+------+-------+------+-------+------+-------+------+
| id|1_price|1_unit|2_price|2_unit|3_price|3_unit|4_price|4_unit|
+---+-------+------+-------+------+-------+------+-------+------+
|100| 23| 10| 45| 11| 67| 12| 78| 13|
|101| 23| 10| 45| 13| 67| 14| 78| 15|
|102| 23| 10| 45| 11| 67| 16| 78| 18|
+---+-------+------+-------+------+-------+------+-------+------+
We just aggregate both on the price
and the unit
column after pivoting on the day.
If naming required as in question,
dff.select([F.col(c).name('_'.join(x for x in c.split('_')[::-1])) for c in dff.columns]).show()
+---+-------+------+-------+------+-------+------+-------+------+
| id|price_1|unit_1|price_2|unit_2|price_3|unit_3|price_4|unit_4|
+---+-------+------+-------+------+-------+------+-------+------+
|100| 23| 10| 45| 11| 67| 12| 78| 13|
|101| 23| 10| 45| 13| 67| 14| 78| 15|
|102| 23| 10| 45| 11| 67| 16| 78| 18|
+---+-------+------+-------+------+-------+------+-------+------+
This is the example showing how to group, pivot and aggregate using multiple columns for each.
It’s not straightforward that when pivoting on multiple columns, you first need to create one more column which should be used for pivoting.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('clsA', 'id1', 'a', 'x', 100, 15),
('clsA', 'id1', 'a', 'x', 110, 16),
('clsA', 'id1', 'a', 'y', 105, 14),
('clsA', 'id2', 'a', 'y', 110, 14),
('clsA', 'id1', 'b', 'y', 100, 13),
('clsA', 'id1', 'b', 'x', 120, 16),
('clsA', 'id2', 'b', 'y', 120, 17)],
['cls', 'id', 'grp1', 'grp2', 'price', 'units'])
Aggregation:
df = df.withColumn('_pivot', F.concat_ws('_', 'grp1', 'grp2'))
df = df.groupBy('cls', 'id').pivot('_pivot').agg(
F.first('price').alias('price'),
F.first('units').alias('unit')
)
df.show()
# +----+---+---------+--------+---------+--------+---------+--------+---------+--------+
# | cls| id|a_x_price|a_x_unit|a_y_price|a_y_unit|b_x_price|b_x_unit|b_y_price|b_y_unit|
# +----+---+---------+--------+---------+--------+---------+--------+---------+--------+
# |clsA|id2| null| null| 110| 14| null| null| 120| 17|
# |clsA|id1| 100| 15| 105| 14| 120| 16| 100| 13|
# +----+---+---------+--------+---------+--------+---------+--------+---------+--------+