Replace null with empty string when writing Spark dataframe

Question:

Is there a way to replace null values in a column with empty string when writing spark dataframe to file?

Sample data:

+----------------+------------------+
|   UNIQUE_MEM_ID|              DATE|
+----------------+------------------+
|            1156|              null|
|            3787|        2016-07-05|
|            1156|              null|
|            5064|              null|
|            5832|              null|
|            3787|              null|
|            5506|              null|
|            7538|              null|
|            7436|              null|
|            5091|              null|
|            8673|              null|
|            2631|              null|
|            8561|              null|
|            3516|              null|
|            1156|              null|
|            5832|              null|
|            2631|        2016-07-07|
Asked By: ben

||

Answers:

check this out. you can when and otherwise.

    df.show()

    #InputDF
    # +-------------+----------+
    # |UNIQUE_MEM_ID|      DATE|
    # +-------------+----------+
    # |         1156|      null|
    # |         3787|2016-07-05|
    # |         1156|      null|
    # +-------------+----------+


    df.withColumn("DATE", F.when(F.col("DATE").isNull(), '').otherwise(F.col("DATE"))).show()

    #OUTPUTDF
    # +-------------+----------+
    # |UNIQUE_MEM_ID|      DATE|
    # +-------------+----------+
    # |         1156|          |
    # |         3787|2016-07-05|
    # |         1156|          |
    # +-------------+----------+

To apply the above logic to all the columns of dataframe. you can use for loop and iterate through columns and fill empty string when column value is null.

 df.select( *[ F.when(F.col(column).isNull(),'').otherwise(F.col(column)).alias(column) for column in df.columns]).show()
Answered By: kites

Use either .na.fill(),fillna() functions for this case.

  • If you have all string columns then df.na.fill('') will replace all null with '' on all columns.
  • For int columns df.na.fill('').na.fill(0) replace null with 0
  • Another way would be creating a dict for the columns and replacement value df.fillna({'col1':'replacement_value',...,'col(n)':'replacement_value(n)'})

Example:

df.show()
#+-------------+----------+
#|UNIQUE_MEM_ID|      DATE|
#+-------------+----------+
#|         1156|      null|
#|         3787|      null|
#|         2631|2016007-07|
#+-------------+----------+
from pyspark.sql.functions import *

df.na.fill('').show()
df.fillna({'DATE':''}).show()
#+-------------+----------+
#|UNIQUE_MEM_ID|      DATE|
#+-------------+----------+
#|         1156|          |
#|         3787|          |
#|         2631|2016007-07|
#+-------------+----------+
Answered By: notNull
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.