Replace null with empty string when writing Spark dataframe

Question

Is there a way to replace null values in a column with empty string when writing spark dataframe to file?

Sample data:

+----------------+------------------+
|   UNIQUE_MEM_ID|              DATE|
+----------------+------------------+
|            1156|              null|
|            3787|        2016-07-05|
|            1156|              null|
|            5064|              null|
|            5832|              null|
|            3787|              null|
|            5506|              null|
|            7538|              null|
|            7436|              null|
|            5091|              null|
|            8673|              null|
|            2631|              null|
|            8561|              null|
|            3516|              null|
|            1156|              null|
|            5832|              null|
|            2631|        2016-07-07|

Asked By: ben

||

Source

Answer 1

check this out. you can when and otherwise.

    df.show()

    #InputDF
    # +-------------+----------+
    # |UNIQUE_MEM_ID|      DATE|
    # +-------------+----------+
    # |         1156|      null|
    # |         3787|2016-07-05|
    # |         1156|      null|
    # +-------------+----------+


    df.withColumn("DATE", F.when(F.col("DATE").isNull(), '').otherwise(F.col("DATE"))).show()

    #OUTPUTDF
    # +-------------+----------+
    # |UNIQUE_MEM_ID|      DATE|
    # +-------------+----------+
    # |         1156|          |
    # |         3787|2016-07-05|
    # |         1156|          |
    # +-------------+----------+

To apply the above logic to all the columns of dataframe. you can use for loop and iterate through columns and fill empty string when column value is null.

 df.select( *[ F.when(F.col(column).isNull(),'').otherwise(F.col(column)).alias(column) for column in df.columns]).show()

Answered By: kites

Answer 2

Use either .na.fill(),fillna() functions for this case.

If you have all string columns then df.na.fill('') will replace all null with '' on all columns.
For int columns df.na.fill('').na.fill(0) replace null with 0
Another way would be creating a dict for the columns and replacement value df.fillna({'col1':'replacement_value',...,'col(n)':'replacement_value(n)'})

Example:

df.show()
#+-------------+----------+
#|UNIQUE_MEM_ID|      DATE|
#+-------------+----------+
#|         1156|      null|
#|         3787|      null|
#|         2631|2016007-07|
#+-------------+----------+
from pyspark.sql.functions import *

df.na.fill('').show()
df.fillna({'DATE':''}).show()
#+-------------+----------+
#|UNIQUE_MEM_ID|      DATE|
#+-------------+----------+
#|         1156|          |
#|         3787|          |
#|         2631|2016007-07|
#+-------------+----------+

Answered By: notNull

Replace null with empty string when writing Spark dataframe

Question:

Answers: