How to check array contains string by using pyspark with this structure

Question:

The curly brackets are odd. Tried with different approaches, but none of them works

# root
#  |-- L: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- S: string (nullable = true)

# +------------------+
# |                 L|
# +------------------+
# |[{string1}]|
# |[{string2}]|
# +------------------+
Asked By: TommyQu

||

Answers:

Use filter() to get array elements matching given criteria.

Since, the elements of array are of type struct, use getField() to read the string type field, and then use contains() to check if the string contains the search term.

Following sample example searches term "hello":

df = spark.createDataFrame(data=[[[("hello world",)]],[[("foo bar",)]]], schema="L array<struct<S string>>")

string_to_search = "hello"

import pyspark.sql.functions as F

df = df.withColumn("arr_contains_str", 
                   F.size( 
                          F.filter("L", 
                                   lambda e: e.getField("S") 
                                              .contains(string_to_search))) > 0)

df.show(truncate=False)

Output:

+---------------+----------------+
|L              |arr_contains_str|
+---------------+----------------+
|[{hello world}]|true            |
|[{foo bar}]    |false           |
+---------------+----------------+
Answered By: Azhar Khan
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.