Filter df when values matches part of a string in pyspark

Question:

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. ‘google.com’.

I have tried:

import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)

But this throws:

TypeError: _TypeError: 'Column' object is not callable'

How do I go around and filter my df properly?

Asked By: gaatjeniksaan

||

Answers:

Spark 2.2 onwards

df.filter(df.location.contains('google.com'))

Spark 2.2 documentation link


Spark 2.1 and before

You can use plain SQL in filter

df.filter("location like '%google.com%'")

or with DataFrame column methods

df.filter(df.location.like('%google.com%'))

Spark 2.1 documentation link

Answered By: mrsrinivas

pyspark.sql.Column.contains() is only available in pyspark version 2.2 and above.

df.where(df.location.contains('google.com'))
Answered By: joaofbsm

When filtering a DataFrame with string values, I find that the pyspark.sql.functions lower and upper come in handy, if your data could have column entries like “foo” and “Foo”:

import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))
Answered By: caffreyd

You can try the following expression, which helps you search for multiple strings at the same time:

df.filter(""" location rlike 'google.com|amazon.com|github.com' """)

Answered By: Rakesh Chintha