Remove any row with at least 1 NA with PySpark


I have a pyspark dataframe and I would like to remove any row countaining at least one NA.
I know how to do so only for one column (code below).

How to do the same for all columns of the dataframe?

Reproducible example

# Import modules
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import col
from pyspark.sql import Row

# Defining SparkContext

# Defining SparkSession
spark = SparkSession 
    .appName("Introduction au DataFrame") 

# Initiating DataFrame
values = [("1","2","3"), 
          ("NA","1", "2"), 
          ("4", "NA", "1")] 
columns = ['var1', 
df = spark.createDataFrame(values, columns)

# Initial dataframe
|   1|   2|   3|
|  NA|   1|   2|
|   4|  NA|   1|

# Subset rows without NAs (column 'var1')
|   1|   2|   3|
|   4|  NA|   1|

My expected output

|   1|   2|   3|

What I also tried

I have tried the following but it seems that PySpark doesn’t recognize NAs as in pandas.
It only recognizes null values.[count(when(isnan('var1'), True))]).show()
Asked By: Yacine Hajji



try this one :


you can specify the paramter in dropna method also:
if how = 'any' , then it’s your case
or how = 'all' the row will be removed if all columns are null
how by default is ‘any’

Answered By: lackti
new = ({'NA': None})#Replace string NA with null
       .dropna()#Drop NA
Answered By: wwnde
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.