Splitting each Multi-Category columns to Multiple columns with counts
Question:
date
Value1
Value2
Value3
16-08-2022
a
b
e
16-08-2022
a
b
f
16-08-2022
c
d
f
output
date
Value1_a
Value1_c
Value2_b
Value2_d
Value3_e
Value3_f
16-08-2022
2
1
2
1
1
2
continues like this for more columns maybe 10, I will aggregate on date and split the categorical columns with counts for each category , currently doing like this
spark.sql('select a.dateid,a.id,count(case when Value1="a" then 1 end) as value1_a,
count(case when Value1="b" then 1 end) as value1_b,
from prep_data a group by a.dateid,a.id').show()
Need a good way to do this , any Python/Sql solution is fine .
Answers:
You can create an array of structs for the fields except the date
field. Use inline
to explode and create new columns, and then concatenate the field name and field value.
data_sdf.
withColumn('val_struct_arr',
func.array(*[func.struct(func.lit(k).alias('name'), func.col(k).alias('val'))
for k in data_sdf.columns if k != 'dt']
)
).
selectExpr('dt', 'inline(val_struct_arr)').
withColumn('name_val_concat', func.concat_ws('_', 'name', 'val')).
groupBy('dt').
pivot('name_val_concat').
count().
fillna(0).
show()
# +----------+--------+--------+--------+--------+--------+--------+
# | dt|value1_a|value1_c|value2_b|value2_d|value3_e|value3_f|
# +----------+--------+--------+--------+--------+--------+--------+
# |16-08-2022| 2| 1| 2| 1| 1| 2|
# +----------+--------+--------+--------+--------+--------+--------+
df1 =(data_sdf.withColumn('tab',F.array(*[F.struct(concat_ws('_',F.lit(k),col(k)).alias('var'), F.col(k).alias('val')) for k in data_sdf.columns if k != 'date']))#Create struct of columns and values
.selectExpr('date','inline(tab)')#Convert df from wide to lon with one column for values
.groupBy('date').pivot('var').agg(F.count('val')).show(truncate=False)#pivot
)
+----------+--------+--------+--------+--------+--------+--------+
|date |Value1_a|Value1_c|Value2_b|Value2_d|Value3_e|Value3_f|
+----------+--------+--------+--------+--------+--------+--------+
|16-08-2022|2 |1 |2 |1 |1 |2 |
+----------+--------+--------+--------+--------+--------+--------+
date | Value1 | Value2 | Value3 |
---|---|---|---|
16-08-2022 | a | b | e |
16-08-2022 | a | b | f |
16-08-2022 | c | d | f |
output
date | Value1_a | Value1_c | Value2_b | Value2_d | Value3_e | Value3_f |
---|---|---|---|---|---|---|
16-08-2022 | 2 | 1 | 2 | 1 | 1 | 2 |
continues like this for more columns maybe 10, I will aggregate on date and split the categorical columns with counts for each category , currently doing like this
spark.sql('select a.dateid,a.id,count(case when Value1="a" then 1 end) as value1_a,
count(case when Value1="b" then 1 end) as value1_b,
from prep_data a group by a.dateid,a.id').show()
Need a good way to do this , any Python/Sql solution is fine .
You can create an array of structs for the fields except the date
field. Use inline
to explode and create new columns, and then concatenate the field name and field value.
data_sdf.
withColumn('val_struct_arr',
func.array(*[func.struct(func.lit(k).alias('name'), func.col(k).alias('val'))
for k in data_sdf.columns if k != 'dt']
)
).
selectExpr('dt', 'inline(val_struct_arr)').
withColumn('name_val_concat', func.concat_ws('_', 'name', 'val')).
groupBy('dt').
pivot('name_val_concat').
count().
fillna(0).
show()
# +----------+--------+--------+--------+--------+--------+--------+
# | dt|value1_a|value1_c|value2_b|value2_d|value3_e|value3_f|
# +----------+--------+--------+--------+--------+--------+--------+
# |16-08-2022| 2| 1| 2| 1| 1| 2|
# +----------+--------+--------+--------+--------+--------+--------+
df1 =(data_sdf.withColumn('tab',F.array(*[F.struct(concat_ws('_',F.lit(k),col(k)).alias('var'), F.col(k).alias('val')) for k in data_sdf.columns if k != 'date']))#Create struct of columns and values
.selectExpr('date','inline(tab)')#Convert df from wide to lon with one column for values
.groupBy('date').pivot('var').agg(F.count('val')).show(truncate=False)#pivot
)
+----------+--------+--------+--------+--------+--------+--------+
|date |Value1_a|Value1_c|Value2_b|Value2_d|Value3_e|Value3_f|
+----------+--------+--------+--------+--------+--------+--------+
|16-08-2022|2 |1 |2 |1 |1 |2 |
+----------+--------+--------+--------+--------+--------+--------+