How to make sure values are map to the right delta table column?
Question:
I’m writing a PySpark job to read the Values column from table1. Table1 has two column -> ID, Values
Sample data in the Values column:
+----+-----------------------------------+
| ID | values |
+----+-----------------------------------+
| 1 | a=10&b=2&c=13&e=55&d=78&j=98&l=99 |
| 2 | l=22&e=67&j=34&a=7&c=9&d=77&b=66 |
+----+-----------------------------------+
I have to read the values column from a delta table and split it. Then I have to store it in another delta table as depicted below:
+----+-----------------------------------+
| ID | a | b | c | d | e | j | l |
+----+-----------------------------------+
| 1 | 10 | 2 | 13 | 78 | 55 | 98 | 99 |
| 2 | 7 | 66 | 9 | 77 | 67 | 34 | 22 |
+----+-----------------------------------+
Any suggestion to resolve this would be helpful.
Answers:
You can do the following
from pyspark.sql import functions as F
(
df
.withColumn("values", F.explode(F.split(F.col("values"), "&", limit=0)))
.withColumn("tag", F.regexp_extract(F.col("values"),"^[a-z]+",0))
.withColumn("values",F.regexp_replace(F.col("values"),"^[a-z]+[=]",""))
.groupby("ID")
.pivot("tag")
.agg(F.first(F.col("values")))
.show()
)
Output:
| ID| a| b| c| d| e| j| l|
+---+---+---+---+---+---+---+---+
| 1| 10| 2| 13| 78| 55| 98| 99|
| 2| 7| 66| 9| 77| 67| 34| 22|
+---+---+---+---+---+---+---+---+
You can convert values
column to map type by using transform
function after splitting. After conversion select all keys from map.
df = spark.createDataFrame([(1, "a=10&b=2&c=13&e=55&d=78&j=98&l=99"),
(2, "l=22&e=67&j=34&a=7&c=9&d=77&b=66 ")],
["ID", "values"])
transformed_df =
df.withColumn("values",
expr("transform(split(values, '&'), c-> map(split(c, '=')[0], cast(split(c, '=')[1] as int)))"))
.withColumn("values",
aggregate("values", create_map().cast("map<string,int>"), lambda acc, m: map_concat(acc, m)))
# if alphabet values are fixed
keys = ['a', 'b', 'c', 'd', 'e', 'j', 'l']
# else or to avoid hardcoded values
keys = sorted(transformed_df.select(explode_outer('values')).
select(collect_set("key").alias("key")).first().asDict().get("key"))
transformed_df.select("ID", *[col("values").getItem(k).alias(k) for k in keys]).show()
+---+---+---+---+---+---+---+---+
| ID| a| b| c| d| e| j| l|
+---+---+---+---+---+---+---+---+
| 1| 10| 2| 13| 78| 55| 98| 99|
| 2| 7| 66| 9| 77| 67| 34| 22|
+---+---+---+---+---+---+---+---+
I’m writing a PySpark job to read the Values column from table1. Table1 has two column -> ID, Values
Sample data in the Values column:
+----+-----------------------------------+
| ID | values |
+----+-----------------------------------+
| 1 | a=10&b=2&c=13&e=55&d=78&j=98&l=99 |
| 2 | l=22&e=67&j=34&a=7&c=9&d=77&b=66 |
+----+-----------------------------------+
I have to read the values column from a delta table and split it. Then I have to store it in another delta table as depicted below:
+----+-----------------------------------+
| ID | a | b | c | d | e | j | l |
+----+-----------------------------------+
| 1 | 10 | 2 | 13 | 78 | 55 | 98 | 99 |
| 2 | 7 | 66 | 9 | 77 | 67 | 34 | 22 |
+----+-----------------------------------+
Any suggestion to resolve this would be helpful.
You can do the following
from pyspark.sql import functions as F
(
df
.withColumn("values", F.explode(F.split(F.col("values"), "&", limit=0)))
.withColumn("tag", F.regexp_extract(F.col("values"),"^[a-z]+",0))
.withColumn("values",F.regexp_replace(F.col("values"),"^[a-z]+[=]",""))
.groupby("ID")
.pivot("tag")
.agg(F.first(F.col("values")))
.show()
)
Output:
| ID| a| b| c| d| e| j| l|
+---+---+---+---+---+---+---+---+
| 1| 10| 2| 13| 78| 55| 98| 99|
| 2| 7| 66| 9| 77| 67| 34| 22|
+---+---+---+---+---+---+---+---+
You can convert values
column to map type by using transform
function after splitting. After conversion select all keys from map.
df = spark.createDataFrame([(1, "a=10&b=2&c=13&e=55&d=78&j=98&l=99"),
(2, "l=22&e=67&j=34&a=7&c=9&d=77&b=66 ")],
["ID", "values"])
transformed_df =
df.withColumn("values",
expr("transform(split(values, '&'), c-> map(split(c, '=')[0], cast(split(c, '=')[1] as int)))"))
.withColumn("values",
aggregate("values", create_map().cast("map<string,int>"), lambda acc, m: map_concat(acc, m)))
# if alphabet values are fixed
keys = ['a', 'b', 'c', 'd', 'e', 'j', 'l']
# else or to avoid hardcoded values
keys = sorted(transformed_df.select(explode_outer('values')).
select(collect_set("key").alias("key")).first().asDict().get("key"))
transformed_df.select("ID", *[col("values").getItem(k).alias(k) for k in keys]).show()
+---+---+---+---+---+---+---+---+
| ID| a| b| c| d| e| j| l|
+---+---+---+---+---+---+---+---+
| 1| 10| 2| 13| 78| 55| 98| 99|
| 2| 7| 66| 9| 77| 67| 34| 22|
+---+---+---+---+---+---+---+---+