Pyspark stack to fill existing column

Question:

I have a data frame like below

col1, col2 … coln, key, value

I got the key, value after exploding a col with format map<string, string>

Now I want to stack col2 to coln and insert them in the key, value columns. I want the previous values in key, value cols to remain as it is.

How to achieve this? One solution I can think of is pivot the key, value column and then create the stack expression. Anything better and faster than this?

Edit –
Input
input

Output

Output

Asked By: Rahul Kumar

||

Answers:

You can add the new elements within the map column before exploding it. It will be easier to process.

Here’s an example using a sample data similar to your sample’s structure.

data_sdf.show()

# +---+---+---+---+------------------------+
# |id |c1 |c2 |c3 |c4                      |
# +---+---+---+---+------------------------+
# |abc|1  |2  |3  |{key_b -> 5, key_a -> 4}|
# +---+---+---+---+------------------------+

col_list = ['c1', 'c2', 'c3']

data_sdf. 
    withColumn('col2structarr', 
               func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('value'))
                            for k in col_list]
                          )
               ). 
    withColumn('map2structarr', func.map_entries('c4')). 
    withColumn('structarrs_merged', func.array_union('map2structarr', 'col2structarr')). 
    withColumn('mergedstructarr2map', func.map_from_entries('structarrs_merged')). 
    selectExpr('id', 'explode(mergedstructarr2map)'). 
    show(truncate=False)

# +---+-----+-----+
# |id |key  |value|
# +---+-----+-----+
# |abc|key_b|5    |
# |abc|key_a|4    |
# |abc|c1   |1    |
# |abc|c2   |2    |
# |abc|c3   |3    |
# +---+-----+-----+
  • a map column is like a dictionary and has key-value pairs. So, when it is converted to array of structs, it generates 2 fields within the structs – key, value.
  • map2structarr – convert the map column to array of structs, so that we can use array_union for the new element addition.
  • col2structarr – using the aforementioned info on map columns and array of structs, we can create an array of structs for the columns that we need to add in the map column.
  • structarrs_merged – merge the 2 arrays created above (1 from map column, and 1 from the columns we want to add) using array_join. The function requires the arrays to be of same data type.
  • mergedstructarr2map – an array of strut can be converted to map type column using map_from_entries.
  • use explode to create key and value columns from the final map type column.

Here is a snapshot of all the individual columns created above (before the explode)

+---+---+---+---+------------------------+---------------------------+------------------------+---------------------------------------------------+---------------------------------------------------+
|id |c1 |c2 |c3 |c4                      |col2structarr              |map2structarr           |structarrs_merged                                  |mergedstructarr2map                                |
+---+---+---+---+------------------------+---------------------------+------------------------+---------------------------------------------------+---------------------------------------------------+
|abc|1  |2  |3  |{key_b -> 5, key_a -> 4}|[{c1, 1}, {c2, 2}, {c3, 3}]|[{key_b, 5}, {key_a, 4}]|[{key_b, 5}, {key_a, 4}, {c1, 1}, {c2, 2}, {c3, 3}]|{key_b -> 5, key_a -> 4, c1 -> 1, c2 -> 2, c3 -> 3}|
+---+---+---+---+------------------------+---------------------------+------------------------+---------------------------------------------------+---------------------------------------------------+

root
 |-- id: string (nullable = true)
 |-- c1: long (nullable = true)
 |-- c2: long (nullable = true)
 |-- c3: long (nullable = true)
 |-- c4: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- col2structarr: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- key: string (nullable = false)
 |    |    |-- value: long (nullable = true)
 |-- map2structarr: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- key: string (nullable = false)
 |    |    |-- value: long (nullable = true)
 |-- structarrs_merged: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- key: string (nullable = false)
 |    |    |-- value: long (nullable = true)
 |-- mergedstructarr2map: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
Answered By: samkart
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.