PySpark filter using startswith from list

Question:

I have a list of elements that may start a couple of strings that are of record in an RDD. If I have and element list of "yes" and "no", they should match "yes23" and "no3" but not "35yes" or "41no". Using PySpark, how can I use startswith to match elements in list or tuple.

An example DF would be:

+-----+------+
|index| label|
+-----+------+
|    1|yes342|
|    2| 45yes|
|    3| no123|
|    4|  75no|
+-----+------+

When I try:

Element_List = ['yes', 'no']
filter_DF = DF.where(DF.label.startswith(tuple(Element_List)))

The resulting df should look something like:

+-----+------+
|index| label|
+-----+------+
|    1|yes342|
|    3| no123|
+-----+------+

Instead, I get the error:

Py4JError: An error occurred while calling o250.startsWith. Trace:
py4j.Py4JException: Method startsWith([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)

It looks like startsWith can’t be used with any type of list. Is there a simple workaround?

Asked By: Jenks

||

Answers:

Compose expression like this:

from pyspark.sql.functions import col, lit
from functools import reduce

element_list = ['yes','no']

df = spark.createDataFrame(
    ["yes23", "no3", "35yes", """41no["maybe"]"""],
    "string"
).toDF("location")

starts_with = reduce(
    lambda x, y: x | y,
    [col("location").startswith(s) for s in element_list], 
    lit(False))

df.where(starts_with).show()
# +--------+
# |location|
# +--------+
# |   yes23|
# |     no3|
# +--------+
Answered By: Alper t. Turker

I feel best way to achieve this is with native PySpark function like rlike(). startswith() is meant for filtering the static strings. It can’t accept dynamic content. If you want to dynamically take the keywords from list, the best bet can be creating a regular expression from the list as below.

# List
li = ['yes', 'no']
# frame RegEx from the List
# in this case strings starting with yes/no i.e. "^(yes|no)"
reg_str = r"^("+ "|".join(li) + ")"

Custom UDFs or working with RDD functions might work, but usage of custom UDFs might hinder the performance.

Below is the complete working example.

#Test Dataframe
df = spark.createDataFrame(
    ["yes23", "no3", "35yes"],
    "string"
).toDF("label")

# List
li = ['yes', 'no']
# frame RegEx from the List
# in this case strings starting with yes/no i.e. "^(yes|no)"
reg_str = r"^("+ "|".join(li) + ")"

#Filter dataframe with RegEx
df.filter(df.label.rlike(reg_str)).show()

# +--------+
# |label   |
# +--------+
# |   yes23|
# |     no3|
# +--------+
Answered By: Kris

When the elements in the list are of the same length, we can do this:

element_list = ['ye', 'no']
df = df.filter(F.col('label')[0:2].isin(element_list))

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 'yes342'),
     (2, '45yes'),
     (3, 'no123'),
     (4, '75no')],
    ['index', 'label'])
df.show()
# +-----+------+
# |index| label|
# +-----+------+
# |    1|yes342|
# |    2| 45yes|
# |    3| no123|
# |    4|  75no|
# +-----+------+

element_list = ['ye', 'no']
df = df.filter(F.col('label')[0:2].isin(element_list))

df.show()
# +-----+------+
# |index| label|
# +-----+------+
# |    1|yes342|
# |    3| no123|
# +-----+------+
Answered By: ZygD