pyspark get element from array Column of struct based on condition

Question:

I have a spark df with the following schema:

 |-- col1 : string
 |-- col2 : string
 |-- customer: struct
 |    |-- smt: string
 |    |-- attributes: array (nullable = true)
 |    |    |-- element: struct
 |    |    |     |-- key: string
 |    |    |     |-- value: string

df:

#+-------+-------+---------------------------------------------------------------------------+
#|col1   |col2   |customer                                                                   |
#+-------+-------+---------------------------------------------------------------------------+
#|col1_XX|col2_XX|"attributes":[[{"key": "A", "value": "123"},{"key": "B", "value": "456"}]  |
#+-------+-------+---------------------------------------------------------------------------+

and the json input for the array look like this:

...
          "attributes": [
            {
              "key": "A",
              "value": "123"
            },
            {
              "key": "B",
              "value": "456"
            }
          ],

I would like to loop attributes array and get the element with key="B" and then select the corresponding value. I don’t want to use explode because I would like to avoid join dataframes.
Is it possible to perform this kind of operation directly using spark ‘Column’ ?

Expected output will be:

#+-------+-------+-----+
#|col1   |col2   |B    |                                                               |
#+-------+-------+-----+
#|col1_XX|col2_XX|456  |
#+-------+-------+-----+

any help would be appreciated

Asked By: Salvatore Nedia

||

Answers:

You can use filter function to filter the array of structs then get value:

from pyspark.sql import functions as F

df2 = df.withColumn(
    "B", 
    F.expr("filter(customer.attributes, x -> x.key = 'B')")[0]["value"]
)
Answered By: blackbishop