How to make sure values are map to the right delta table column?

Question:

I’m writing a PySpark job to read the Values column from table1. Table1 has two column -> ID, Values

Sample data in the Values column:

+----+-----------------------------------+
| ID |               values              |
+----+-----------------------------------+
| 1  | a=10&b=2&c=13&e=55&d=78&j=98&l=99 |
| 2  | l=22&e=67&j=34&a=7&c=9&d=77&b=66  |
+----+-----------------------------------+

I have to read the values column from a delta table and split it. Then I have to store it in another delta table as depicted below:

+----+-----------------------------------+
| ID | a  | b  | c  |  d  | e  | j |  l  |
+----+-----------------------------------+
| 1  | 10 | 2  | 13 |  78 | 55 | 98 | 99 |
| 2  |  7 | 66 | 9  |  77 | 67 | 34 | 22 |
+----+-----------------------------------+

Any suggestion to resolve this would be helpful.

Asked By: curious-coder

||

Answers:

You can do the following

from pyspark.sql import functions as F

(
    df
    .withColumn("values", F.explode(F.split(F.col("values"), "&", limit=0)))
    .withColumn("tag", F.regexp_extract(F.col("values"),"^[a-z]+",0))
    .withColumn("values",F.regexp_replace(F.col("values"),"^[a-z]+[=]",""))
    .groupby("ID")
    .pivot("tag")
    .agg(F.first(F.col("values")))
    .show()
 )

Output:

| ID|  a|  b|  c|  d|  e|  j|  l|
+---+---+---+---+---+---+---+---+
|  1| 10|  2| 13| 78| 55| 98| 99|
|  2|  7| 66|  9| 77| 67| 34| 22|
+---+---+---+---+---+---+---+---+
Answered By: langtang

You can convert values column to map type by using transform function after splitting. After conversion select all keys from map.

df = spark.createDataFrame([(1, "a=10&b=2&c=13&e=55&d=78&j=98&l=99"),
                            (2, "l=22&e=67&j=34&a=7&c=9&d=77&b=66 ")],
                           ["ID", "values"])

transformed_df = 
    df.withColumn("values",
                  expr("transform(split(values, '&'), c-> map(split(c, '=')[0], cast(split(c, '=')[1] as int)))")) 
        .withColumn("values",
                    aggregate("values", create_map().cast("map<string,int>"), lambda acc, m: map_concat(acc, m))) 
 
    # if alphabet values are fixed
keys = ['a', 'b', 'c', 'd', 'e', 'j', 'l']
# else or to  avoid hardcoded values
keys = sorted(transformed_df.select(explode_outer('values')). 
              select(collect_set("key").alias("key")).first().asDict().get("key"))
transformed_df.select("ID", *[col("values").getItem(k).alias(k) for k in keys]).show()

+---+---+---+---+---+---+---+---+
| ID|  a|  b|  c|  d|  e|  j|  l|
+---+---+---+---+---+---+---+---+
|  1| 10|  2| 13| 78| 55| 98| 99|
|  2|  7| 66|  9| 77| 67| 34| 22|
+---+---+---+---+---+---+---+---+
Answered By: Mohana B C