How to Concatenate a specific Value in column when Conditions are matching?

Question:

I have a SQL Code which i am trying to Convert into Pyspark?
The SQL Query looks like this: I need to Concatenate ‘0’ at starting of ‘ADDRESS_HOME’ if the below Query Conditions Satisfies.

   UPDATE STUDENT_DATA 
   SET STUDENT_DATA.ADDRESS_HOME = "0" & [STUDENT_DATA].ADDRESS_HOME
   WHERE (((STUDENT_DATA.STATE_ABB)="TURIN" Or
   (STUDENT_DATA.STATE_ABB)="RUSH" Or 
   (STUDENT_DATA.STATE_ABB)="MEXIC" Or 
   (STUDENT_DATA.STATE_ABB)="VINTA") 
   AND ((Len([ADDRESS_HOME])) < "5"));

Thank you in Advance for your responses

# +---+---------------+---------+
# | ID|ADDRESS_HOME   | STATE_ABB|
# +---+---------------+---------+
# |  1|      7645     |RUSH      |
# |  2|      98364    |MEXIC     |
# |  3|      2980     |TURIN     |
# |  4|      6728     |VINTA     |
# |  5|       128     |VINTA     |


EXPECTED OUTPUT
# +---+---------------+---------+
    # | ID|ADDRESS_HOME   | STATE_ABB|
    # +---+---------------+---------+
    # |  1|      07645     |RUSH      |
    # |  2|      98364     |MEXIC     |
    # |  3|      02980     |TURIN     |
    # |  4|      06728     |VINTA     |
    # |  5|      0128      |VINTA     |
Asked By: BigData Lover

||

Answers:

First you filter, your DF serching for the values you want to update.

Then you update the columns (First withcolumn)

After updating, you join your updated DF with your original dataframe (do this to get all values in one dataframe again). And do a coalesce to the FINAL ADDRESS

Finally, you select the values from the original DF (Id and State) and the updated value (Final_Address…since you did a coalesce, the values not updated will not be null, they are going to be the update value on the filtered condition, and the original value on the condition not matched in the filter).

This answer should solve your problem, BUT, @Emma answers is more efficient.

df = df.filter(
            (f.col("STATE_ABB").isin(f.lit("TURIN"), f.lit("RUSH"), f.lit("TURIN"), f.lit("VINTA")) &
            (f.len("ADDRESS_HOME") < 5)
        ).withColumn(
            "ADDRESS_HOME_CONCAT",
            f.concat(f.lit("0"),f.col("ADDRESS_HOME"))
        ).alias("df_filtered").join(
            df.alias("original_df"),
            on=f.col("original_df.Id") == f.col("df_filtered.Id")
            how='left'
        ).withColumn(
          "FINAL_ADDRESS",
          f.coalesce(f.col("df_filtered.ADDRESS_HOME_CONCAT"), f.col("original_df.ADDRESS_HOME")
    ).select(
            f.col("original_df.Id").alias("Id"),
            f.col("FINAL_ADDRESS").alias("ADDRESS_HOME"),
            f.col("original_df.STATE_ABB").alias("STATE_ABB")
        )

Sorry for any typo missing, I’ve posted it from my cellphone!

Answered By: OdiumPura

If you want to align the ADDRESS_HOME to be 5 digits and pad with 0, you can use lpad.

df = df.withColumn('ADDRESS_HOME', F.lpad('ADDRESS_HOME', 5, '0'))

If you want only pad with 1 char (0), when the ADDRESS_HOME has less than 5 chars.

df = (df.withColumn('ADDRESS_HOME', F.when(F.length('ADDRESS_HOME') < 5, F.concat(F.lit('0'), F.col('ADDRESS_HOME'))))
                                     .otherwise(F.col('ADDRESS_HOME')))

UPDATE:

You can convert all OR criteria to IN clause(isin) then use logical and with the other criteria.

states = ['RUSH', 'MEXIC', 'TURIN', 'VINTA']

df = (df.withColumn('ADDRESS_HOME', 
                    F.when(F.col('STATE_ABB').isin(states) & (F.length('ADDRESS_HOME') < 5), 
                           F.concat(F.lit('0'), F.col('ADDRESS_HOME')))
                     .otherwise(F.col('ADDRESS_HOME'))))
Answered By: Emma
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.