Replace null with empty string when writing Spark dataframe
Question:
Is there a way to replace null
values in a column with empty string when writing spark dataframe to file?
Sample data:
+----------------+------------------+
| UNIQUE_MEM_ID| DATE|
+----------------+------------------+
| 1156| null|
| 3787| 2016-07-05|
| 1156| null|
| 5064| null|
| 5832| null|
| 3787| null|
| 5506| null|
| 7538| null|
| 7436| null|
| 5091| null|
| 8673| null|
| 2631| null|
| 8561| null|
| 3516| null|
| 1156| null|
| 5832| null|
| 2631| 2016-07-07|
Answers:
check this out. you can when
and otherwise
.
df.show()
#InputDF
# +-------------+----------+
# |UNIQUE_MEM_ID| DATE|
# +-------------+----------+
# | 1156| null|
# | 3787|2016-07-05|
# | 1156| null|
# +-------------+----------+
df.withColumn("DATE", F.when(F.col("DATE").isNull(), '').otherwise(F.col("DATE"))).show()
#OUTPUTDF
# +-------------+----------+
# |UNIQUE_MEM_ID| DATE|
# +-------------+----------+
# | 1156| |
# | 3787|2016-07-05|
# | 1156| |
# +-------------+----------+
To apply the above logic to all the columns of dataframe. you can use for loop and iterate through columns and fill empty string when column value is null.
df.select( *[ F.when(F.col(column).isNull(),'').otherwise(F.col(column)).alias(column) for column in df.columns]).show()
Use either .na.fill()
,fillna()
functions for this case.
- If you have all
string
columns then df.na.fill('')
will replace all null with ''
on all columns.
- For
int
columns df.na.fill('').na.fill(0)
replace null with 0
- Another way would be creating a
dict
for the columns and replacement value df.fillna({'col1':'replacement_value',...,'col(n)':'replacement_value(n)'})
Example:
df.show()
#+-------------+----------+
#|UNIQUE_MEM_ID| DATE|
#+-------------+----------+
#| 1156| null|
#| 3787| null|
#| 2631|2016007-07|
#+-------------+----------+
from pyspark.sql.functions import *
df.na.fill('').show()
df.fillna({'DATE':''}).show()
#+-------------+----------+
#|UNIQUE_MEM_ID| DATE|
#+-------------+----------+
#| 1156| |
#| 3787| |
#| 2631|2016007-07|
#+-------------+----------+
Is there a way to replace null
values in a column with empty string when writing spark dataframe to file?
Sample data:
+----------------+------------------+
| UNIQUE_MEM_ID| DATE|
+----------------+------------------+
| 1156| null|
| 3787| 2016-07-05|
| 1156| null|
| 5064| null|
| 5832| null|
| 3787| null|
| 5506| null|
| 7538| null|
| 7436| null|
| 5091| null|
| 8673| null|
| 2631| null|
| 8561| null|
| 3516| null|
| 1156| null|
| 5832| null|
| 2631| 2016-07-07|
check this out. you can when
and otherwise
.
df.show()
#InputDF
# +-------------+----------+
# |UNIQUE_MEM_ID| DATE|
# +-------------+----------+
# | 1156| null|
# | 3787|2016-07-05|
# | 1156| null|
# +-------------+----------+
df.withColumn("DATE", F.when(F.col("DATE").isNull(), '').otherwise(F.col("DATE"))).show()
#OUTPUTDF
# +-------------+----------+
# |UNIQUE_MEM_ID| DATE|
# +-------------+----------+
# | 1156| |
# | 3787|2016-07-05|
# | 1156| |
# +-------------+----------+
To apply the above logic to all the columns of dataframe. you can use for loop and iterate through columns and fill empty string when column value is null.
df.select( *[ F.when(F.col(column).isNull(),'').otherwise(F.col(column)).alias(column) for column in df.columns]).show()
Use either .na.fill()
,fillna()
functions for this case.
- If you have all
string
columns thendf.na.fill('')
will replace all null with''
on all columns. - For
int
columnsdf.na.fill('').na.fill(0)
replace null with0
- Another way would be creating a
dict
for the columns and replacement valuedf.fillna({'col1':'replacement_value',...,'col(n)':'replacement_value(n)'})
Example:
df.show()
#+-------------+----------+
#|UNIQUE_MEM_ID| DATE|
#+-------------+----------+
#| 1156| null|
#| 3787| null|
#| 2631|2016007-07|
#+-------------+----------+
from pyspark.sql.functions import *
df.na.fill('').show()
df.fillna({'DATE':''}).show()
#+-------------+----------+
#|UNIQUE_MEM_ID| DATE|
#+-------------+----------+
#| 1156| |
#| 3787| |
#| 2631|2016007-07|
#+-------------+----------+