Pyspark stack to fill existing column
Question:
I have a data frame like below
col1, col2 … coln, key, value
I got the key, value after exploding a col with format map<string, string>
Now I want to stack col2 to coln and insert them in the key, value columns. I want the previous values in key, value cols to remain as it is.
How to achieve this? One solution I can think of is pivot the key, value column and then create the stack expression. Anything better and faster than this?
Output
Answers:
You can add the new elements within the map column before exploding it. It will be easier to process.
Here’s an example using a sample data similar to your sample’s structure.
data_sdf.show()
# +---+---+---+---+------------------------+
# |id |c1 |c2 |c3 |c4 |
# +---+---+---+---+------------------------+
# |abc|1 |2 |3 |{key_b -> 5, key_a -> 4}|
# +---+---+---+---+------------------------+
col_list = ['c1', 'c2', 'c3']
data_sdf.
withColumn('col2structarr',
func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('value'))
for k in col_list]
)
).
withColumn('map2structarr', func.map_entries('c4')).
withColumn('structarrs_merged', func.array_union('map2structarr', 'col2structarr')).
withColumn('mergedstructarr2map', func.map_from_entries('structarrs_merged')).
selectExpr('id', 'explode(mergedstructarr2map)').
show(truncate=False)
# +---+-----+-----+
# |id |key |value|
# +---+-----+-----+
# |abc|key_b|5 |
# |abc|key_a|4 |
# |abc|c1 |1 |
# |abc|c2 |2 |
# |abc|c3 |3 |
# +---+-----+-----+
- a map column is like a dictionary and has key-value pairs. So, when it is converted to array of structs, it generates 2 fields within the structs –
key
, value
.
map2structarr
– convert the map column to array of structs, so that we can use array_union
for the new element addition.
col2structarr
– using the aforementioned info on map columns and array of structs, we can create an array of structs for the columns that we need to add in the map column.
structarrs_merged
– merge the 2 arrays created above (1 from map column, and 1 from the columns we want to add) using array_join
. The function requires the arrays to be of same data type.
mergedstructarr2map
– an array of strut can be converted to map type column using map_from_entries
.
- use
explode
to create key
and value
columns from the final map type column.
Here is a snapshot of all the individual columns created above (before the explode
)
+---+---+---+---+------------------------+---------------------------+------------------------+---------------------------------------------------+---------------------------------------------------+
|id |c1 |c2 |c3 |c4 |col2structarr |map2structarr |structarrs_merged |mergedstructarr2map |
+---+---+---+---+------------------------+---------------------------+------------------------+---------------------------------------------------+---------------------------------------------------+
|abc|1 |2 |3 |{key_b -> 5, key_a -> 4}|[{c1, 1}, {c2, 2}, {c3, 3}]|[{key_b, 5}, {key_a, 4}]|[{key_b, 5}, {key_a, 4}, {c1, 1}, {c2, 2}, {c3, 3}]|{key_b -> 5, key_a -> 4, c1 -> 1, c2 -> 2, c3 -> 3}|
+---+---+---+---+------------------------+---------------------------+------------------------+---------------------------------------------------+---------------------------------------------------+
root
|-- id: string (nullable = true)
|-- c1: long (nullable = true)
|-- c2: long (nullable = true)
|-- c3: long (nullable = true)
|-- c4: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = true)
|-- col2structarr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- key: string (nullable = false)
| | |-- value: long (nullable = true)
|-- map2structarr: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- key: string (nullable = false)
| | |-- value: long (nullable = true)
|-- structarrs_merged: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- key: string (nullable = false)
| | |-- value: long (nullable = true)
|-- mergedstructarr2map: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = true)
I have a data frame like below
col1, col2 … coln, key, value
I got the key, value after exploding a col with format map<string, string>
Now I want to stack col2 to coln and insert them in the key, value columns. I want the previous values in key, value cols to remain as it is.
How to achieve this? One solution I can think of is pivot the key, value column and then create the stack expression. Anything better and faster than this?
Output
You can add the new elements within the map column before exploding it. It will be easier to process.
Here’s an example using a sample data similar to your sample’s structure.
data_sdf.show()
# +---+---+---+---+------------------------+
# |id |c1 |c2 |c3 |c4 |
# +---+---+---+---+------------------------+
# |abc|1 |2 |3 |{key_b -> 5, key_a -> 4}|
# +---+---+---+---+------------------------+
col_list = ['c1', 'c2', 'c3']
data_sdf.
withColumn('col2structarr',
func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('value'))
for k in col_list]
)
).
withColumn('map2structarr', func.map_entries('c4')).
withColumn('structarrs_merged', func.array_union('map2structarr', 'col2structarr')).
withColumn('mergedstructarr2map', func.map_from_entries('structarrs_merged')).
selectExpr('id', 'explode(mergedstructarr2map)').
show(truncate=False)
# +---+-----+-----+
# |id |key |value|
# +---+-----+-----+
# |abc|key_b|5 |
# |abc|key_a|4 |
# |abc|c1 |1 |
# |abc|c2 |2 |
# |abc|c3 |3 |
# +---+-----+-----+
- a map column is like a dictionary and has key-value pairs. So, when it is converted to array of structs, it generates 2 fields within the structs –
key
,value
. map2structarr
– convert the map column to array of structs, so that we can usearray_union
for the new element addition.col2structarr
– using the aforementioned info on map columns and array of structs, we can create an array of structs for the columns that we need to add in the map column.structarrs_merged
– merge the 2 arrays created above (1 from map column, and 1 from the columns we want to add) usingarray_join
. The function requires the arrays to be of same data type.mergedstructarr2map
– an array of strut can be converted to map type column usingmap_from_entries
.- use
explode
to createkey
andvalue
columns from the final map type column.
Here is a snapshot of all the individual columns created above (before the explode
)
+---+---+---+---+------------------------+---------------------------+------------------------+---------------------------------------------------+---------------------------------------------------+
|id |c1 |c2 |c3 |c4 |col2structarr |map2structarr |structarrs_merged |mergedstructarr2map |
+---+---+---+---+------------------------+---------------------------+------------------------+---------------------------------------------------+---------------------------------------------------+
|abc|1 |2 |3 |{key_b -> 5, key_a -> 4}|[{c1, 1}, {c2, 2}, {c3, 3}]|[{key_b, 5}, {key_a, 4}]|[{key_b, 5}, {key_a, 4}, {c1, 1}, {c2, 2}, {c3, 3}]|{key_b -> 5, key_a -> 4, c1 -> 1, c2 -> 2, c3 -> 3}|
+---+---+---+---+------------------------+---------------------------+------------------------+---------------------------------------------------+---------------------------------------------------+
root
|-- id: string (nullable = true)
|-- c1: long (nullable = true)
|-- c2: long (nullable = true)
|-- c3: long (nullable = true)
|-- c4: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = true)
|-- col2structarr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- key: string (nullable = false)
| | |-- value: long (nullable = true)
|-- map2structarr: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- key: string (nullable = false)
| | |-- value: long (nullable = true)
|-- structarrs_merged: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- key: string (nullable = false)
| | |-- value: long (nullable = true)
|-- mergedstructarr2map: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = true)