PySpark: Create a condition from a string

Question:

I have to apply conditions to pyspark dataframes based on a distribution.

My distribution looks like:
mp = [413, 291, 205, 169, 135]

And I am generating condition expression like this:

when_decile = (F.when((F.col(colm) >= float(mp[0])), F.lit(1))
               .when( (F.col(colm) >= float(mp[1])) & (F.col(colm) < float(mp[0])), F.lit(2))
               .when( (F.col(colm) >= float(mp[2])) & (F.col(colm) < float(mp[1])), F.lit(3))
               .when( (F.col(colm) >= float(mp[3])) & (F.col(colm) < float(mp[2])), F.lit(4))
               .when( (F.col(colm) >= float(mp[4])) & (F.col(colm) < float(mp[3])), F.lit(5))
               .otherwise(F.lit(-99)))

Then applying it to dataframe:

df_temp = df_temp.withColumn('decile_rank', when_decile)

Now I have to keep this code in a function which receives 'mp' and 'df_temp' as inputs. The length of 'mp' is variable.
So, now I am generating condition expression like this:

when_decile = '(F.when((F.col(colm) >= float(' + str(mp[0]) + '), F.lit(1))'
for i in range(len(mp)-1):
    when_decile += '.when( (F.col(colm) >= float(' + str(mp[i+1]) + ')) & (F.col(colm) < float(' + str(mp[i]) + ')), F.lit(' + str(i+2) + '))'
when_decile += '.otherwise(F.lit(-99)))'

The problem now is that the 'when_decile' is a string and it cannot be applied to 'df_temp'.

How can I convert this string to a condition?

Asked By: karan

||

Answers:

Try this,

df_temp = df_temp.withColumn('decile_rank', eval(when_decile))
Answered By: Tushar Patil

You could use F.expr() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.expr.html), and rewrite your condition in SQL rather than in pyspark.

Something like this, with a minimal example:

colm = "barz"
sdf = spark.createDataFrame(pd.DataFrame({colm: [200, 300, 400, 500]}))

def generate_when_decile_condition(mp):
    """Builds the when-decile condition as a SQL string"""
    when_decile = f"CASE WHEN `{colm}` >= {mp[0]} THEN 1 "
    for i in range(1, len(mp)):
        when_decile += f"WHEN `{colm}` >= {mp[i]} THEN {i + 1} "
    when_decile += 'ELSE -99 END'
    print("CASE_WHEN condition:n", when_decile)
    return when_decile

sdf.withColumn(
    "decile_rank",
    F.expr(generate_when_decile_condition([413, 291])),  # <-- using F.expr here
).toPandas()

enter image description here

Note also that you should be able to simplify your case-when conditions: there is also no need for the second < condition in each case-when line (it’s guaranteed from the fact that the previous case-when condition wasn’t fulfilled)

Answered By: ksgj1
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.