How to filter out values in Pyspark using multiple OR Condition?


I am trying to change a SQL query into Pyspark. The SQL Query looks like this. I need to set ZIPCODE=’0′ where the below conditions satisfies.


Pyspark Query i am trying to implement is

df=df.withColumn('ZIPCODE', F.when( (col('COUNTRY_TABLE.STATE') == 'TN') | (col('COUNTRY_TABLE.STATE') == 'DEL') 
| (col('COUNTRY_TABLE.STATE') == 'UK') | (col('COUNTRY_TABLE.STATE') == 'UP') | (col('COUNTRY_TABLE.STATE') == 'HP')  
| (col('COUNTRY_TABLE.STATE') == 'JK') | (col('COUNTRY_TABLE.STATE') == 'MP') & (col('length_ZIP') < '5'), '0')

In my pyspark code i have used one column as length ZIP so basically what i am doing i am taking out length of column(ZIPCODE) in a separate column ‘length_ZIP’ and checking with the value with that column.

df=df.withColumn('ZIPCODE', substring('ZIPCODE', 1,5)) -- take only first five character

I am not getting my expected result. can anyone help me what i can do to get the result.

Asked By: BigData Lover



First you can clean up your code this way:

First, instead of creating one condition to each state by hand you can create a states list. So, if you want to add or remove any country you can easily do this by updating the list. The code becomes more readable.

states = ["TN", "DEL", "UK", "UP", "HP", "JK", "MP"]

and then:

Here you create a new column called "length_ZIP" running the pyspark’s length function over the ZIPCODE column and retrieving the zipcodes lengths.

import pyspark.sql.functions as F

df = df.withColumn("length_ZIP",F.length("ZIPCODE"))

Finally you overwrite the ZIPCODE column with the condition: if the state is in the states list AND the zipcode length is less then 5 is returned zero to the ZIPCODE column. Otherwise it return the own ZIPCODE number.

df = df.withColumn("ZIPCODE", F.when(F.col("state").isin(states) & (F.col('length_ZIP') < '5'), '0').otherwise(df["ZIPCODE"]))

I think that the main problem of your code is the logical operators precedence. In Python the "&" operator is read first in an operation when compared to "|" when the operations are in the same level.
If you do something like:

True | False & False #True

It will return True because the False & False is evaluated first (return False) and then True | False will return True

Answered By: Cezar Peixeiro

In your case you are giving AND condition along with OR condition without separating them because of that you are not getting desired output

To resolve this, keep your all OR conditions in a Round bracket and then give the AND condition. It will first check all OR condition and then for that it will check AND condition and give output.

from pyspark.sql.functions import col,length,when

df2 = df1.withColumn('Zipcode', when(((col('State') == 'TN') | (col('State') == 'DEL') 
| (col('State') == 'UK') | (col('State') == 'UP') | (col('State') == 'HP')  
| (col('State') == 'JK') | (col('State') == 'MP')) & (col('length_ZIP') < 5), '0')
  • Execution or Output

enter image description here

Answered By: PratikLad