PySpark replace null in column with value in other column

Question:

I want to replace null values in one column with the values in an adjacent column ,for example if i have

A|B
0,1
2,null
3,null
4,2

I want it to be:

A|B
0,1
2,2
3,3
4,2

Tried with

df.na.fill(df.A,"B")

But didnt work, it says value should be a float, int, long, string, or dict

Any ideas?

Asked By: Luis Leal

||

Answers:

df.rdd.map(lambda row: row if row[1] else Row(a=row[0],b=row[0])).toDF().show()
Answered By: Pushkr

We can use coalesce

from pyspark.sql.functions import coalesce
    
df.withColumn("B",coalesce(df.B,df.A)) 
Answered By: Luis Leal

Another Answer.

If the below df1 your dataframe

rd1 = sc.parallelize([(0,1), (2,None), (3,None), (4,2)])
df1 = rd1.toDF(['A', 'B'])

from pyspark.sql.functions import when
df1.select('A',
           when( df1.B.isNull(), df1.A).otherwise(df1.B).alias('B')
          )
   .show()
Answered By: Rags

Note: coalesce will not replace NaN values, only nulls:

import pyspark.sql.functions as F

>>> cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
>>> cDf.show()
+----+----+
|   a|   b|
+----+----+
|null|null|
|   1|null|
|null|   2|
+----+----+

>>> cDf.select(F.coalesce(cDf["a"], cDf["b"])).show()
+--------------+
|coalesce(a, b)|
+--------------+
|          null|
|             1|
|             2|
+--------------+

Let’s now create a pandas.DataFrame with None entries, convert it into spark.DataFrame and use coalesce again:

>>> cDf_from_pd = spark.createDataFrame(pd.DataFrame({'a': [None, 1, None], 'b': [None, None, 2]}))
>>> cDf_from_pd.show()
+---+---+
|  a|  b|
+---+---+
|NaN|NaN|
|1.0|NaN|
|NaN|2.0|
+---+---+

>>> cDf_from_pd.select(F.coalesce(cDf_from_pd["a"], cDf_from_pd["b"])).show()
+--------------+
|coalesce(a, b)|
+--------------+
|           NaN|
|           1.0|
|           NaN|
+--------------+

In which case you’ll need to first call replace on your DataFrame to convert NaNs to nulls.

Answered By: Tomasz Bartkowiak
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.