Spark SQL search inside an array for a struct
Question:
My data structure is defined approximately as follows:
schema = StructType([
# ... fields skipped
StructField("extra_features",
ArrayType(StructType([
StructField("key", StringType(), False),
StructField("value", StringType(), True)
])), nullable = False)],
)
Now, I’d like to search for entries in a data frame where a struct {"key": "somekey", "value": "somevalue"}
exists in the array column. How do I do this?
Answers:
Spark has a function array_contains
that can be used to check the contents of an ArrayType
column, but unfortunately it doesn’t seem like it can handle arrays of complex types. It is possible to do it with a UDF (User Defined Function) however:
from pyspark.sql.types import *
from pyspark.sql import Row
import pyspark.sql.functions as F
schema = StructType([StructField("extra_features", ArrayType(StructType([
StructField("key", StringType(), False),
StructField("value", StringType(), True)])),
False)])
df = spark.createDataFrame([
Row([{'key': 'a', 'value': '1'}]),
Row([{'key': 'b', 'value': '2'}])], schema)
# UDF to check whether {'key': 'a', 'value': '1'} is in an array
# The actual data of a (nested) StructType value is a Row
contains_keyval = F.udf(lambda extra_features: Row(key='a', value='1') in extra_features, BooleanType())
df.where(contains_keyval(df.extra_features)).collect()
This results in:
[Row(extra_features=[Row(key=u'a', value=u'1')])]
You can also use the UDF to add another column that indicates whether the key-value pair is present:
df.withColumn('contains_it', contains_keyval(df.extra_features)).collect()
results in:
[Row(extra_features=[Row(key=u'a', value=u'1')], contains_it=True),
Row(extra_features=[Row(key=u'b', value=u'2')], contains_it=False)]
Since Spark 2.4.0 you can use the functions exist.
Example with SparkSQL:
SELECT
EXISTS
(
ARRAY(named_struct("key": "a", "value": "1"), named_struct("key": "b", "value": "2")),
x -> x = named_struct("key": "a", "value": "1")
)
Example with PySpark:
df.filter('exists(extra_features, x -> x = named_struct("key": "a", "value": "1"))')
Note that not all the functions to manipulate arrays start with array_*.
Ex: exist, filter, size, …
My data structure is defined approximately as follows:
schema = StructType([
# ... fields skipped
StructField("extra_features",
ArrayType(StructType([
StructField("key", StringType(), False),
StructField("value", StringType(), True)
])), nullable = False)],
)
Now, I’d like to search for entries in a data frame where a struct {"key": "somekey", "value": "somevalue"}
exists in the array column. How do I do this?
Spark has a function array_contains
that can be used to check the contents of an ArrayType
column, but unfortunately it doesn’t seem like it can handle arrays of complex types. It is possible to do it with a UDF (User Defined Function) however:
from pyspark.sql.types import *
from pyspark.sql import Row
import pyspark.sql.functions as F
schema = StructType([StructField("extra_features", ArrayType(StructType([
StructField("key", StringType(), False),
StructField("value", StringType(), True)])),
False)])
df = spark.createDataFrame([
Row([{'key': 'a', 'value': '1'}]),
Row([{'key': 'b', 'value': '2'}])], schema)
# UDF to check whether {'key': 'a', 'value': '1'} is in an array
# The actual data of a (nested) StructType value is a Row
contains_keyval = F.udf(lambda extra_features: Row(key='a', value='1') in extra_features, BooleanType())
df.where(contains_keyval(df.extra_features)).collect()
This results in:
[Row(extra_features=[Row(key=u'a', value=u'1')])]
You can also use the UDF to add another column that indicates whether the key-value pair is present:
df.withColumn('contains_it', contains_keyval(df.extra_features)).collect()
results in:
[Row(extra_features=[Row(key=u'a', value=u'1')], contains_it=True),
Row(extra_features=[Row(key=u'b', value=u'2')], contains_it=False)]
Since Spark 2.4.0 you can use the functions exist.
Example with SparkSQL:
SELECT
EXISTS
(
ARRAY(named_struct("key": "a", "value": "1"), named_struct("key": "b", "value": "2")),
x -> x = named_struct("key": "a", "value": "1")
)
Example with PySpark:
df.filter('exists(extra_features, x -> x = named_struct("key": "a", "value": "1"))')
Note that not all the functions to manipulate arrays start with array_*.
Ex: exist, filter, size, …