PySpark: Create a condition from a string
Question:
I have to apply conditions to pyspark dataframes based on a distribution.
My distribution looks like:
mp = [413, 291, 205, 169, 135]
And I am generating condition expression like this:
when_decile = (F.when((F.col(colm) >= float(mp[0])), F.lit(1))
.when( (F.col(colm) >= float(mp[1])) & (F.col(colm) < float(mp[0])), F.lit(2))
.when( (F.col(colm) >= float(mp[2])) & (F.col(colm) < float(mp[1])), F.lit(3))
.when( (F.col(colm) >= float(mp[3])) & (F.col(colm) < float(mp[2])), F.lit(4))
.when( (F.col(colm) >= float(mp[4])) & (F.col(colm) < float(mp[3])), F.lit(5))
.otherwise(F.lit(-99)))
Then applying it to dataframe:
df_temp = df_temp.withColumn('decile_rank', when_decile)
Now I have to keep this code in a function which receives 'mp'
and 'df_temp'
as inputs. The length of 'mp'
is variable.
So, now I am generating condition expression like this:
when_decile = '(F.when((F.col(colm) >= float(' + str(mp[0]) + '), F.lit(1))'
for i in range(len(mp)-1):
when_decile += '.when( (F.col(colm) >= float(' + str(mp[i+1]) + ')) & (F.col(colm) < float(' + str(mp[i]) + ')), F.lit(' + str(i+2) + '))'
when_decile += '.otherwise(F.lit(-99)))'
The problem now is that the 'when_decile'
is a string and it cannot be applied to 'df_temp'
.
How can I convert this string to a condition?
Answers:
Try this,
df_temp = df_temp.withColumn('decile_rank', eval(when_decile))
You could use F.expr()
(https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.expr.html), and rewrite your condition in SQL rather than in pyspark.
Something like this, with a minimal example:
colm = "barz"
sdf = spark.createDataFrame(pd.DataFrame({colm: [200, 300, 400, 500]}))
def generate_when_decile_condition(mp):
"""Builds the when-decile condition as a SQL string"""
when_decile = f"CASE WHEN `{colm}` >= {mp[0]} THEN 1 "
for i in range(1, len(mp)):
when_decile += f"WHEN `{colm}` >= {mp[i]} THEN {i + 1} "
when_decile += 'ELSE -99 END'
print("CASE_WHEN condition:n", when_decile)
return when_decile
sdf.withColumn(
"decile_rank",
F.expr(generate_when_decile_condition([413, 291])), # <-- using F.expr here
).toPandas()
Note also that you should be able to simplify your case-when conditions: there is also no need for the second <
condition in each case-when line (it’s guaranteed from the fact that the previous case-when condition wasn’t fulfilled)
I have to apply conditions to pyspark dataframes based on a distribution.
My distribution looks like:
mp = [413, 291, 205, 169, 135]
And I am generating condition expression like this:
when_decile = (F.when((F.col(colm) >= float(mp[0])), F.lit(1))
.when( (F.col(colm) >= float(mp[1])) & (F.col(colm) < float(mp[0])), F.lit(2))
.when( (F.col(colm) >= float(mp[2])) & (F.col(colm) < float(mp[1])), F.lit(3))
.when( (F.col(colm) >= float(mp[3])) & (F.col(colm) < float(mp[2])), F.lit(4))
.when( (F.col(colm) >= float(mp[4])) & (F.col(colm) < float(mp[3])), F.lit(5))
.otherwise(F.lit(-99)))
Then applying it to dataframe:
df_temp = df_temp.withColumn('decile_rank', when_decile)
Now I have to keep this code in a function which receives 'mp'
and 'df_temp'
as inputs. The length of 'mp'
is variable.
So, now I am generating condition expression like this:
when_decile = '(F.when((F.col(colm) >= float(' + str(mp[0]) + '), F.lit(1))'
for i in range(len(mp)-1):
when_decile += '.when( (F.col(colm) >= float(' + str(mp[i+1]) + ')) & (F.col(colm) < float(' + str(mp[i]) + ')), F.lit(' + str(i+2) + '))'
when_decile += '.otherwise(F.lit(-99)))'
The problem now is that the 'when_decile'
is a string and it cannot be applied to 'df_temp'
.
How can I convert this string to a condition?
Try this,
df_temp = df_temp.withColumn('decile_rank', eval(when_decile))
You could use F.expr()
(https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.expr.html), and rewrite your condition in SQL rather than in pyspark.
Something like this, with a minimal example:
colm = "barz"
sdf = spark.createDataFrame(pd.DataFrame({colm: [200, 300, 400, 500]}))
def generate_when_decile_condition(mp):
"""Builds the when-decile condition as a SQL string"""
when_decile = f"CASE WHEN `{colm}` >= {mp[0]} THEN 1 "
for i in range(1, len(mp)):
when_decile += f"WHEN `{colm}` >= {mp[i]} THEN {i + 1} "
when_decile += 'ELSE -99 END'
print("CASE_WHEN condition:n", when_decile)
return when_decile
sdf.withColumn(
"decile_rank",
F.expr(generate_when_decile_condition([413, 291])), # <-- using F.expr here
).toPandas()
Note also that you should be able to simplify your case-when conditions: there is also no need for the second <
condition in each case-when line (it’s guaranteed from the fact that the previous case-when condition wasn’t fulfilled)