Multiple condition filter on dataframe

Question:

Can anyone explain to me why I am getting different results for these 2 expressions ? I am trying to filter between 2 dates:

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")
  .select("col1","col2").distinct().count()

Result : 37M

vs

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")
  .select("col1","col2").distinct().count()

Result: 25M

How are they different ? It seems to me like they should produce the same result

Asked By: femibyte

||

Answers:

TL;DR To pass multiple conditions to filter or where use Column objects and logical operators (&, |, ~). See Pyspark: multiple conditions in when clause.

df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))

You can also use a single SQL string:

df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")

In practice it makes more sense to use between:

df.filter(col("act_date").between("2016-10-01", "2017-04-01"))
df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")

The first approach is not even remote valid. In Python, and returns:

  • The last element if all expressions are “truthy”.
  • The first “falsey” element otherwise.

As a result

"act_date <='2017-04-01'" and "act_date >='2016-10-01'"

is evaluated to (any non-empty string is truthy):

"act_date >='2016-10-01'"
Answered By: zero323

In first case

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")
  .select("col1","col2").distinct().count()

the result is values more than 2016-10-01 that means all the values above 2017-04-01 also.

Whereas in second case

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")
  .select("col1","col2").distinct().count()

the result is the values between 2016-10-01 to 2017-04-01.

Answered By: Ash Man