Flatten Map Type in Pyspark
Question:
I have a dataframe as below
+-------------+--------------+----+-------+-----------------------------------------------------------------------------------+
|empId |organization |h_cd|status |additional |
+-------------+--------------+----+-------+-----------------------------------------------------------------------------------+
|FTE:56e662f |CATENA |0 |CURRENT|{hr_code -> 84534, bgc_val -> 170187, interviewPanel -> 6372, meetingId -> 3671} |
|FTE:633e7bc |Data Science |0 |CURRENT|{hr_code -> 21036, bgc_val -> 170187, interviewPanel -> 764, meetingId -> 577} |
|FTE:d9badd2 |CATENA |0 |CURRENT|{hr_code -> 60696, bgc_val -> 88770} |
+-------------+--------------+----+-------+-----------------------------------------------------------------------------------+
I wanted to flatten it and create a dataframe as below –
+-------------+--------------+----+-------+------------+------------+-------------------+---------------+
|empId |organization |h_cd|status |hr_code |bgc_val |interviewPanel | meetingId |
+-------------+--------------+----+-------+------------+------------+-------------------+---------------+
|FTE:56e662f |CATENA |0 |CURRENT|84534 |170187 |6372 |3671 |
|FTE:633e7bc |Data Science |0 |CURRENT|21036 |170187 |764 |577 |
|FTE:d9badd2 |CATENA |0 |CURRENT|60696 |88770 |Null |Null |
+-------------+--------------+----+-------+------------+------------+-------------------+---------------+
My existing logic is as below
new_df = df.rdd.map(lambda x: (x.empId, x.h_cd, x.status ,x.data["hr_code"], x.data["bgc_val"], x.data["interviewPanel"], x.data["meetingId"], x.category))
.toDF(["emp_id","h_cd","status","hr_code","bgc_val","interview_panel","meeting_id","category"])
However, using this logic to create new_df
and trying to write dataframe, im running into Error
org.apache.spark.api.python.PythonException: 'KeyError: 'interviewPanel''
This is caused due to the fact that there is not additional.interviewPanel
in the map type field for the empId FTE:d9badd2
.
Can someone suggest the best way to handle this and just adding null to the dataframe if the field key:val is not present in maptype field.
Thanks in advance!!
Answers:
Just use getItem on the column. E.g.
df.select("*", F.col("additional").getItem("meetingId").alias("meetingId"))
You can also collect the key-names in a list to avoid using hardcoded values (useful when there are a number of keys).
allKeys = df.select(F.explode('additional')).select(F.collect_set("key").alias("key")).first().asDict().get("key")
df.select("*", *[F.col("additional").getItem(key).alias(key) for key in allKeys]).show()
Output:
You just need to use getField
function on map column,
df = spark.createDataFrame([("FTE:56e662f", "CATENA", 0, "CURRENT",
({"hr_code": 84534, "bgc_val": 170187, "interviewPanel": 6372, "meetingId": 3671})),
("FTE:633e7bc", "Data Science", 0, "CURRENT",
({"hr_code": 21036, "bgc_val": 170187, "interviewPanel": 764, "meetingId": 577})),
("FTE:d9badd2", "CATENA", 0, "CURRENT",
({"hr_code": 60696, "bgc_val": 88770}))],
["empId", "organization", "h_cd", "status", "additional"])
df.select("empId", "organization", "h_cd", "status",
col("additional").getField("hr_code").alias("hr_code"),
col("additional").getField("bgc_val").alias("bgc_val"),
col("additional").getField("interviewPanel").alias("interviewPanel"),
col("additional").getField("meetingId").alias("meetingId")
).show(truncate=False)
+-----------+------------+----+-------+-------+-------+--------------+---------+
|empId |organization|h_cd|status |hr_code|bgc_val|interviewPanel|meetingId|
+-----------+------------+----+-------+-------+-------+--------------+---------+
|FTE:56e662f|CATENA |0 |CURRENT|84534 |170187 |6372 |3671 |
|FTE:633e7bc|Data Science|0 |CURRENT|21036 |170187 |764 |577 |
|FTE:d9badd2|CATENA |0 |CURRENT|60696 |88770 |null |null |
+-----------+------------+----+-------+-------+-------+--------------+---------+
Apache Spark Scala answer you can translate to pyspark as well
AFAIK you need to explode , group and pivot by key like below example
import org.apache.spark.sql.functions._
val df= Seq(
( "FTE:56e662f", "CATENA", 0, "CURRENT", Map("hr_code" -> 84534, "bgc_val" -> 170187, "interviewPanel" -> 6372, "meetingId" -> 3671) ),
( "FTE:633e7bc", "Data Science", 0, "CURRENT", Map("hr_code" -> 21036, "bgc_val" -> 170187, "interviewPanel" -> 764, "meetingId" -> 577) ),
( "FTE:d9badd2", "CATENA", 0, "CURRENT", Map("hr_code" -> 60696, "bgc_val" -> 88770) )).toDF("empId", "organization", "h_cd", "status", "additional")
val explodeddf = df.select($"empId", $"organization", $"h_cd", $"status", explode($"additional"))
val grpdf = explodeddf.groupBy($"empId", $"organization", $"h_cd", $"status").pivot("key").agg(first("value"))
val finaldf = grpdf.selectExpr("empId", "organization", "h_cd", "status", "hr_code","bgc_val","interviewPanel", "meetingId")
finaldf.show
Output :
+-----------+------------+----+-------+-------+-------+--------------+---------+
| empId|organization|h_cd| status|bgc_val|hr_code|interviewPanel|meetingId|
+-----------+------------+----+-------+-------+-------+--------------+---------+
|FTE:633e7bc|Data Science| 0|CURRENT| 170187| 21036| 764| 577|
|FTE:d9badd2| CATENA| 0|CURRENT| 88770| 60696| null| null|
|FTE:56e662f| CATENA| 0|CURRENT| 170187| 84534| 6372| 3671|
+-----------+------------+----+-------+-------+-------+--------------+---------+
You can yse .* to transform a struct column into fields columns:
df.select("empId", "organization", "h_cd", "status", "additional.*")
I have a dataframe as below
+-------------+--------------+----+-------+-----------------------------------------------------------------------------------+
|empId |organization |h_cd|status |additional |
+-------------+--------------+----+-------+-----------------------------------------------------------------------------------+
|FTE:56e662f |CATENA |0 |CURRENT|{hr_code -> 84534, bgc_val -> 170187, interviewPanel -> 6372, meetingId -> 3671} |
|FTE:633e7bc |Data Science |0 |CURRENT|{hr_code -> 21036, bgc_val -> 170187, interviewPanel -> 764, meetingId -> 577} |
|FTE:d9badd2 |CATENA |0 |CURRENT|{hr_code -> 60696, bgc_val -> 88770} |
+-------------+--------------+----+-------+-----------------------------------------------------------------------------------+
I wanted to flatten it and create a dataframe as below –
+-------------+--------------+----+-------+------------+------------+-------------------+---------------+
|empId |organization |h_cd|status |hr_code |bgc_val |interviewPanel | meetingId |
+-------------+--------------+----+-------+------------+------------+-------------------+---------------+
|FTE:56e662f |CATENA |0 |CURRENT|84534 |170187 |6372 |3671 |
|FTE:633e7bc |Data Science |0 |CURRENT|21036 |170187 |764 |577 |
|FTE:d9badd2 |CATENA |0 |CURRENT|60696 |88770 |Null |Null |
+-------------+--------------+----+-------+------------+------------+-------------------+---------------+
My existing logic is as below
new_df = df.rdd.map(lambda x: (x.empId, x.h_cd, x.status ,x.data["hr_code"], x.data["bgc_val"], x.data["interviewPanel"], x.data["meetingId"], x.category))
.toDF(["emp_id","h_cd","status","hr_code","bgc_val","interview_panel","meeting_id","category"])
However, using this logic to create new_df
and trying to write dataframe, im running into Error
org.apache.spark.api.python.PythonException: 'KeyError: 'interviewPanel''
This is caused due to the fact that there is not additional.interviewPanel
in the map type field for the empId FTE:d9badd2
.
Can someone suggest the best way to handle this and just adding null to the dataframe if the field key:val is not present in maptype field.
Thanks in advance!!
Just use getItem on the column. E.g.
df.select("*", F.col("additional").getItem("meetingId").alias("meetingId"))
You can also collect the key-names in a list to avoid using hardcoded values (useful when there are a number of keys).
allKeys = df.select(F.explode('additional')).select(F.collect_set("key").alias("key")).first().asDict().get("key")
df.select("*", *[F.col("additional").getItem(key).alias(key) for key in allKeys]).show()
Output:
You just need to use getField
function on map column,
df = spark.createDataFrame([("FTE:56e662f", "CATENA", 0, "CURRENT",
({"hr_code": 84534, "bgc_val": 170187, "interviewPanel": 6372, "meetingId": 3671})),
("FTE:633e7bc", "Data Science", 0, "CURRENT",
({"hr_code": 21036, "bgc_val": 170187, "interviewPanel": 764, "meetingId": 577})),
("FTE:d9badd2", "CATENA", 0, "CURRENT",
({"hr_code": 60696, "bgc_val": 88770}))],
["empId", "organization", "h_cd", "status", "additional"])
df.select("empId", "organization", "h_cd", "status",
col("additional").getField("hr_code").alias("hr_code"),
col("additional").getField("bgc_val").alias("bgc_val"),
col("additional").getField("interviewPanel").alias("interviewPanel"),
col("additional").getField("meetingId").alias("meetingId")
).show(truncate=False)
+-----------+------------+----+-------+-------+-------+--------------+---------+
|empId |organization|h_cd|status |hr_code|bgc_val|interviewPanel|meetingId|
+-----------+------------+----+-------+-------+-------+--------------+---------+
|FTE:56e662f|CATENA |0 |CURRENT|84534 |170187 |6372 |3671 |
|FTE:633e7bc|Data Science|0 |CURRENT|21036 |170187 |764 |577 |
|FTE:d9badd2|CATENA |0 |CURRENT|60696 |88770 |null |null |
+-----------+------------+----+-------+-------+-------+--------------+---------+
Apache Spark Scala answer you can translate to pyspark as well
AFAIK you need to explode , group and pivot by key like below example
import org.apache.spark.sql.functions._
val df= Seq(
( "FTE:56e662f", "CATENA", 0, "CURRENT", Map("hr_code" -> 84534, "bgc_val" -> 170187, "interviewPanel" -> 6372, "meetingId" -> 3671) ),
( "FTE:633e7bc", "Data Science", 0, "CURRENT", Map("hr_code" -> 21036, "bgc_val" -> 170187, "interviewPanel" -> 764, "meetingId" -> 577) ),
( "FTE:d9badd2", "CATENA", 0, "CURRENT", Map("hr_code" -> 60696, "bgc_val" -> 88770) )).toDF("empId", "organization", "h_cd", "status", "additional")
val explodeddf = df.select($"empId", $"organization", $"h_cd", $"status", explode($"additional"))
val grpdf = explodeddf.groupBy($"empId", $"organization", $"h_cd", $"status").pivot("key").agg(first("value"))
val finaldf = grpdf.selectExpr("empId", "organization", "h_cd", "status", "hr_code","bgc_val","interviewPanel", "meetingId")
finaldf.show
Output :
+-----------+------------+----+-------+-------+-------+--------------+---------+
| empId|organization|h_cd| status|bgc_val|hr_code|interviewPanel|meetingId|
+-----------+------------+----+-------+-------+-------+--------------+---------+
|FTE:633e7bc|Data Science| 0|CURRENT| 170187| 21036| 764| 577|
|FTE:d9badd2| CATENA| 0|CURRENT| 88770| 60696| null| null|
|FTE:56e662f| CATENA| 0|CURRENT| 170187| 84534| 6372| 3671|
+-----------+------------+----+-------+-------+-------+--------------+---------+
You can yse .* to transform a struct column into fields columns:
df.select("empId", "organization", "h_cd", "status", "additional.*")