Fill nulls with values from another column in PySpark

Question:

I have a dataset

col_id     col_2     col_3    col_id_b
ABC111     shfhs     34775    null
ABC112     shfhe     34775    DEF345
ABC112     shfhs     34775    GFR563
ABC112     shfgh     34756    TRS572
ABC113     shfdh     34795    null
ABC114     shfhs     34770    null

I am trying to create a new column that is identical to col_id_b, except that the nulls take the value of the corresponding col_id from that row. So:

col_id     col_2     col_3    col_id_b    col_new
ABC111     shfhs     34775    null        ABC111
ABC112     shfhe     34775    DEF345      DEF345
ABC112     shfhs     34775    GFR563      GFR563
ABC112     shfgh     34756    TRS572      TRS572
ABC113     shfdh     34795    null        ABC113
ABC114     shfhs     34770    null        ABC114

I know about:

df.select(coalesce(df["col_id"], df["col_id_b"])).show()

But in my case there are my rows where both are not-null. How do I introduce this condition?

Asked By: johnnydoe

||

Answers:

Just invert the order of the columns:

df.select(coalesce(col('col_id_b'), col('col_id')))

coalesce returns the first column that is not null; so if you specify col_id_b first, it this is not null, you will have col_id_b, otherwise col_id.

Answered By: Ric S