Flatten Map Type in Pyspark

Question:

I have a dataframe as below


+-------------+--------------+----+-------+-----------------------------------------------------------------------------------+
|empId        |organization  |h_cd|status |additional                                                                         |
+-------------+--------------+----+-------+-----------------------------------------------------------------------------------+
|FTE:56e662f  |CATENA        |0   |CURRENT|{hr_code -> 84534, bgc_val -> 170187, interviewPanel -> 6372, meetingId -> 3671}   |
|FTE:633e7bc  |Data Science  |0   |CURRENT|{hr_code -> 21036, bgc_val -> 170187, interviewPanel -> 764, meetingId -> 577}     |
|FTE:d9badd2  |CATENA        |0   |CURRENT|{hr_code -> 60696, bgc_val -> 88770}                                               |
+-------------+--------------+----+-------+-----------------------------------------------------------------------------------+


I wanted to flatten it and create a dataframe as below –

+-------------+--------------+----+-------+------------+------------+-------------------+---------------+
|empId        |organization  |h_cd|status |hr_code     |bgc_val     |interviewPanel     | meetingId     |
+-------------+--------------+----+-------+------------+------------+-------------------+---------------+
|FTE:56e662f  |CATENA        |0   |CURRENT|84534       |170187      |6372               |3671           |
|FTE:633e7bc  |Data Science  |0   |CURRENT|21036       |170187      |764                |577            |
|FTE:d9badd2  |CATENA        |0   |CURRENT|60696       |88770       |Null               |Null           |
+-------------+--------------+----+-------+------------+------------+-------------------+---------------+

My existing logic is as below

new_df = df.rdd.map(lambda x: (x.empId, x.h_cd, x.status ,x.data["hr_code"], x.data["bgc_val"], x.data["interviewPanel"], x.data["meetingId"], x.category)) 
.toDF(["emp_id","h_cd","status","hr_code","bgc_val","interview_panel","meeting_id","category"])

However, using this logic to create new_df and trying to write dataframe, im running into Error

org.apache.spark.api.python.PythonException: 'KeyError: 'interviewPanel''  

This is caused due to the fact that there is not additional.interviewPanel in the map type field for the empId FTE:d9badd2.

Can someone suggest the best way to handle this and just adding null to the dataframe if the field key:val is not present in maptype field.

Thanks in advance!!

Asked By: bunnylorr

||

Answers:

Just use getItem on the column. E.g.

df.select("*", F.col("additional").getItem("meetingId").alias("meetingId"))

You can also collect the key-names in a list to avoid using hardcoded values (useful when there are a number of keys).

allKeys = df.select(F.explode('additional')).select(F.collect_set("key").alias("key")).first().asDict().get("key")

df.select("*", *[F.col("additional").getItem(key).alias(key) for key in allKeys]).show()

Input:
Input

Output:

Output

Answered By: Ronak Jain

You just need to use getField function on map column,

df = spark.createDataFrame([("FTE:56e662f", "CATENA", 0, "CURRENT",
                             ({"hr_code": 84534, "bgc_val": 170187, "interviewPanel": 6372, "meetingId": 3671})),
                            ("FTE:633e7bc", "Data Science", 0, "CURRENT",
                             ({"hr_code": 21036, "bgc_val": 170187, "interviewPanel": 764, "meetingId": 577})),
                            ("FTE:d9badd2", "CATENA", 0, "CURRENT",
                             ({"hr_code": 60696, "bgc_val": 88770}))],
                           ["empId", "organization", "h_cd", "status", "additional"])

df.select("empId", "organization", "h_cd", "status",
          col("additional").getField("hr_code").alias("hr_code"),
          col("additional").getField("bgc_val").alias("bgc_val"),
          col("additional").getField("interviewPanel").alias("interviewPanel"),
          col("additional").getField("meetingId").alias("meetingId")
          ).show(truncate=False)

+-----------+------------+----+-------+-------+-------+--------------+---------+
|empId      |organization|h_cd|status |hr_code|bgc_val|interviewPanel|meetingId|
+-----------+------------+----+-------+-------+-------+--------------+---------+
|FTE:56e662f|CATENA      |0   |CURRENT|84534  |170187 |6372          |3671     |
|FTE:633e7bc|Data Science|0   |CURRENT|21036  |170187 |764           |577      |
|FTE:d9badd2|CATENA      |0   |CURRENT|60696  |88770  |null          |null     |
+-----------+------------+----+-------+-------+-------+--------------+---------+
Answered By: Mohana B C

Apache Spark Scala answer you can translate to pyspark as well

AFAIK you need to explode , group and pivot by key like below example

  import org.apache.spark.sql.functions._


val df= Seq(
  ( "FTE:56e662f", "CATENA", 0, "CURRENT", Map("hr_code" -> 84534, "bgc_val" -> 170187, "interviewPanel" -> 6372, "meetingId" -> 3671) ),
  ( "FTE:633e7bc", "Data Science", 0, "CURRENT", Map("hr_code" -> 21036, "bgc_val" -> 170187, "interviewPanel" -> 764, "meetingId" -> 577) ),
  ( "FTE:d9badd2", "CATENA", 0, "CURRENT", Map("hr_code" -> 60696, "bgc_val" -> 88770) )).toDF("empId", "organization", "h_cd", "status", "additional")
 
val explodeddf = df.select($"empId", $"organization", $"h_cd", $"status", explode($"additional"))
val grpdf = explodeddf.groupBy($"empId", $"organization", $"h_cd", $"status").pivot("key").agg(first("value"))
val finaldf = grpdf.selectExpr("empId", "organization", "h_cd", "status", "hr_code","bgc_val","interviewPanel", "meetingId")
finaldf.show

Output :

+-----------+------------+----+-------+-------+-------+--------------+---------+
|      empId|organization|h_cd| status|bgc_val|hr_code|interviewPanel|meetingId|
+-----------+------------+----+-------+-------+-------+--------------+---------+
|FTE:633e7bc|Data Science|   0|CURRENT| 170187|  21036|           764|      577|
|FTE:d9badd2|      CATENA|   0|CURRENT|  88770|  60696|          null|     null|
|FTE:56e662f|      CATENA|   0|CURRENT| 170187|  84534|          6372|     3671|
+-----------+------------+----+-------+-------+-------+--------------+---------+
Answered By: Ram Ghadiyaram

You can yse .* to transform a struct column into fields columns:

df.select("empId", "organization", "h_cd", "status", "additional.*")
Answered By: Abdennacer Lachiheb