Spark SQL search inside an array for a struct

Question:

My data structure is defined approximately as follows:

schema = StructType([
# ... fields skipped
StructField("extra_features", 
ArrayType(StructType([
    StructField("key", StringType(), False),
    StructField("value", StringType(), True)
])), nullable = False)],
)

Now, I’d like to search for entries in a data frame where a struct {"key": "somekey", "value": "somevalue"} exists in the array column. How do I do this?

Asked By: Konrads

||

Answers:

Spark has a function array_contains that can be used to check the contents of an ArrayType column, but unfortunately it doesn’t seem like it can handle arrays of complex types. It is possible to do it with a UDF (User Defined Function) however:

from pyspark.sql.types import *
from pyspark.sql import Row
import pyspark.sql.functions as F

schema = StructType([StructField("extra_features", ArrayType(StructType([
    StructField("key", StringType(), False),
    StructField("value", StringType(), True)])),
    False)])

df = spark.createDataFrame([
    Row([{'key': 'a', 'value': '1'}]),
    Row([{'key': 'b', 'value': '2'}])], schema)

# UDF to check whether {'key': 'a', 'value': '1'} is in an array
# The actual data of a (nested) StructType value is a Row
contains_keyval = F.udf(lambda extra_features: Row(key='a', value='1') in extra_features, BooleanType())

df.where(contains_keyval(df.extra_features)).collect()

This results in:

[Row(extra_features=[Row(key=u'a', value=u'1')])]

You can also use the UDF to add another column that indicates whether the key-value pair is present:

df.withColumn('contains_it', contains_keyval(df.extra_features)).collect()

results in:

[Row(extra_features=[Row(key=u'a', value=u'1')], contains_it=True),
 Row(extra_features=[Row(key=u'b', value=u'2')], contains_it=False)]
Answered By: sgvd

Since Spark 2.4.0 you can use the functions exist.

Example with SparkSQL:

SELECT
    EXISTS
    (
        ARRAY(named_struct("key": "a", "value": "1"), named_struct("key": "b", "value": "2")),
        x -> x = named_struct("key": "a", "value": "1")
    )

Example with PySpark:

df.filter('exists(extra_features, x -> x = named_struct("key": "a", "value": "1"))')

Note that not all the functions to manipulate arrays start with array_*.
Ex: exist, filter, size, …

Answered By: programort