Remove any row with at least 1 NA with PySpark
Question:
I have a pyspark dataframe and I would like to remove any row countaining at least one NA.
I know how to do so only for one column (code below).
How to do the same for all columns of the dataframe?
Reproducible example
# Import modules
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import col
from pyspark.sql import Row
# Defining SparkContext
SparkContext.getOrCreate()
# Defining SparkSession
spark = SparkSession
.builder
.master("local")
.appName("Introduction au DataFrame")
.getOrCreate()
# Initiating DataFrame
values = [("1","2","3"),
("NA","1", "2"),
("4", "NA", "1")]
columns = ['var1',
'var2',
'var3']
df = spark.createDataFrame(values, columns)
# Initial dataframe
df.show()
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
| NA| 1| 2|
| 4| NA| 1|
+----+----+----+
# Subset rows without NAs (column 'var1')
df.where(~col('var1').contains('NA')).show()
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
| 4| NA| 1|
+----+----+----+
My expected output
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
+----+----+----+
What I also tried
I have tried the following but it seems that PySpark doesn’t recognize NAs as in pandas.
It only recognizes null
values.
df.na.drop().show()
df.select([count(when(isnan('var1'), True))]).show()
df.filter(df['var1'].isNotNull()).show()
Answers:
try this one :
df.dropna().show()
you can specify the paramter in dropna method also:
if how = 'any'
, then it’s your case
or how = 'all'
the row will be removed if all columns are null
how by default is ‘any’
new = (df.na.replace({'NA': None})#Replace string NA with null
.dropna()#Drop NA
).show()
I have a pyspark dataframe and I would like to remove any row countaining at least one NA.
I know how to do so only for one column (code below).
How to do the same for all columns of the dataframe?
Reproducible example
# Import modules
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import col
from pyspark.sql import Row
# Defining SparkContext
SparkContext.getOrCreate()
# Defining SparkSession
spark = SparkSession
.builder
.master("local")
.appName("Introduction au DataFrame")
.getOrCreate()
# Initiating DataFrame
values = [("1","2","3"),
("NA","1", "2"),
("4", "NA", "1")]
columns = ['var1',
'var2',
'var3']
df = spark.createDataFrame(values, columns)
# Initial dataframe
df.show()
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
| NA| 1| 2|
| 4| NA| 1|
+----+----+----+
# Subset rows without NAs (column 'var1')
df.where(~col('var1').contains('NA')).show()
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
| 4| NA| 1|
+----+----+----+
My expected output
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
+----+----+----+
What I also tried
I have tried the following but it seems that PySpark doesn’t recognize NAs as in pandas.
It only recognizes null
values.
df.na.drop().show()
df.select([count(when(isnan('var1'), True))]).show()
df.filter(df['var1'].isNotNull()).show()
try this one :
df.dropna().show()
you can specify the paramter in dropna method also:
if how = 'any'
, then it’s your case
or how = 'all'
the row will be removed if all columns are null
how by default is ‘any’
new = (df.na.replace({'NA': None})#Replace string NA with null
.dropna()#Drop NA
).show()