Remove any row with at least 1 NA with PySpark

Question:

I have a pyspark dataframe and I would like to remove any row countaining at least one NA.
I know how to do so only for one column (code below).

How to do the same for all columns of the dataframe?

Reproducible example

# Import modules
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import col
from pyspark.sql import Row

# Defining SparkContext
SparkContext.getOrCreate() 

# Defining SparkSession
spark = SparkSession 
    .builder 
    .master("local") 
    .appName("Introduction au DataFrame") 
    .getOrCreate()

# Initiating DataFrame
values = [("1","2","3"), 
          ("NA","1", "2"), 
          ("4", "NA", "1")] 
columns = ['var1', 
           'var2', 
           'var3']
df = spark.createDataFrame(values, columns)

# Initial dataframe
df.show()
+----+----+----+
|var1|var2|var3|
+----+----+----+
|   1|   2|   3|
|  NA|   1|   2|
|   4|  NA|   1|
+----+----+----+

# Subset rows without NAs (column 'var1')
df.where(~col('var1').contains('NA')).show()
+----+----+----+
|var1|var2|var3|
+----+----+----+
|   1|   2|   3|
|   4|  NA|   1|
+----+----+----+

My expected output

+----+----+----+
|var1|var2|var3|
+----+----+----+
|   1|   2|   3|
+----+----+----+

What I also tried

I have tried the following but it seems that PySpark doesn’t recognize NAs as in pandas.
It only recognizes null values.

df.na.drop().show()
df.select([count(when(isnan('var1'), True))]).show()
df.filter(df['var1'].isNotNull()).show()
Asked By: Yacine Hajji

||

Answers:

try this one :

df.dropna().show()

you can specify the paramter in dropna method also:
if how = 'any' , then it’s your case
or how = 'all' the row will be removed if all columns are null
how by default is ‘any’

Answered By: lackti
new = (df.na.replace({'NA': None})#Replace string NA with null
       .dropna()#Drop NA
      ).show()
Answered By: wwnde
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.