Splitting each Multi-Category columns to Multiple columns with counts

Question

date	Value1	Value2	Value3
16-08-2022	a	b	e
16-08-2022	a	b	f
16-08-2022	c	d	f

output

date	Value1_a	Value1_c	Value2_b	Value2_d	Value3_e	Value3_f
16-08-2022	2	1	2	1	1	2

continues like this for more columns maybe 10, I will aggregate on date and split the categorical columns with counts for each category , currently doing like this

spark.sql('select a.dateid,a.id,count(case when Value1="a" then 1 end) as value1_a,
count(case when Value1="b" then 1 end) as value1_b,
 from prep_data a group by a.dateid,a.id').show()

Need a good way to do this , any Python/Sql solution is fine .

Asked By: Pavan

||

Source

Answer 1

You can create an array of structs for the fields except the date field. Use inline to explode and create new columns, and then concatenate the field name and field value.

data_sdf. 
    withColumn('val_struct_arr',
               func.array(*[func.struct(func.lit(k).alias('name'), func.col(k).alias('val')) 
                            for k in data_sdf.columns if k != 'dt']
                          )
               ). 
    selectExpr('dt', 'inline(val_struct_arr)'). 
    withColumn('name_val_concat', func.concat_ws('_', 'name', 'val')). 
    groupBy('dt'). 
    pivot('name_val_concat'). 
    count(). 
    fillna(0). 
    show()

# +----------+--------+--------+--------+--------+--------+--------+
# |        dt|value1_a|value1_c|value2_b|value2_d|value3_e|value3_f|
# +----------+--------+--------+--------+--------+--------+--------+
# |16-08-2022|       2|       1|       2|       1|       1|       2|
# +----------+--------+--------+--------+--------+--------+--------+

Answered By: samkart

Answer 2

df1 =(data_sdf.withColumn('tab',F.array(*[F.struct(concat_ws('_',F.lit(k),col(k)).alias('var'), F.col(k).alias('val')) for k in data_sdf.columns if k != 'date']))#Create struct of columns and values
      .selectExpr('date','inline(tab)')#Convert df from wide to lon with one column for values
       .groupBy('date').pivot('var').agg(F.count('val')).show(truncate=False)#pivot
 )

+----------+--------+--------+--------+--------+--------+--------+
|date      |Value1_a|Value1_c|Value2_b|Value2_d|Value3_e|Value3_f|
+----------+--------+--------+--------+--------+--------+--------+
|16-08-2022|2       |1       |2       |1       |1       |2       |
+----------+--------+--------+--------+--------+--------+--------+

Answered By: wwnde

Splitting each Multi-Category columns to Multiple columns with counts

Question:

Answers: