Filtering a pyspark dataframe using isin by exclusion


I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion).

As an example:

df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')]

I get the data frame:

| id|bar|
|  1|  a|
|  2|  b|
|  3|  b|
|  4|  c|
|  5|  d|

I only want to exclude rows where bar is (‘a’ or ‘b’).

Using an SQL expression string it would be:

df.filter('bar not in ("a","b")').show()

Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time?


I am likely to have a list, [‘a’,’b’], of the excluded values that I would like to use.

Asked By: gabrown86



df.filter(( != 'a') & ( != 'b'))
Answered By: Assaf Mendelson

It looks like the ~ gives the functionality that I need, but I am yet to find any appropriate documentation on it.


| id|bar|
|  4|  c|
|  5|  d|
Answered By: gabrown86

Also could be like this

df.filter(col('bar').isin(['a','b']) == False).show()
Answered By: Alezis

Got a gotcha for those with their headspace in Pandas and moving to pyspark

 from pyspark import SparkConf, SparkContext
 from pyspark.sql import SQLContext

 spark_conf = SparkConf().setMaster("local").setAppName("MyAppName")
 sc = SparkContext(conf = spark_conf)
 sqlContext = SQLContext(sc)

 records = [
     {"colour": "red"},
     {"colour": "blue"},
     {"colour": None},

 pandas_df = pd.DataFrame.from_dict(records)
 pyspark_df = sqlContext.createDataFrame(records)

So if we wanted the rows that are not red:


As expected in Pandas

Looking good, and in our pyspark DataFrame


Not what I expected

So after some digging, I found this:
So to include nothingness in our results:

pyspark_df.filter(~pyspark_df["colour"].isin(["red"]) | pyspark_df["colour"].isNull()).show()

much ado about nothing

Answered By: Ryan Collingwood