Filter Pyspark dataframe column with None value
Question:
I’m trying to filter a PySpark dataframe that has None
as a row value:
df.select('dt_mvmt').distinct().collect()
[Row(dt_mvmt=u'2016-03-27'),
Row(dt_mvmt=u'2016-03-28'),
Row(dt_mvmt=u'2016-03-29'),
Row(dt_mvmt=None),
Row(dt_mvmt=u'2016-03-30'),
Row(dt_mvmt=u'2016-03-31')]
and I can filter correctly with an string value:
df[df.dt_mvmt == '2016-03-31']
# some results here
but this fails:
df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0
But there are definitely values on each category. What’s going on?
Answers:
You can use Column.isNull
/ Column.isNotNull
:
df.where(col("dt_mvmt").isNull())
df.where(col("dt_mvmt").isNotNull())
If you want to simply drop NULL
values you can use na.drop
with subset
argument:
df.na.drop(subset=["dt_mvmt"])
Equality based comparisons with NULL
won’t work because in SQL NULL
is undefined so any attempt to compare it with another value returns NULL
:
sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## | null|
## +-------------+
sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## | null|
## +-------------------+
The only valid method to compare value with NULL
is IS
/ IS NOT
which are equivalent to the isNull
/ isNotNull
method calls.
Try to just use isNotNull function.
df.filter(df.dt_mvmt.isNotNull()).count()
To obtain entries whose values in the dt_mvmt
column are not null we have
df.filter("dt_mvmt is not NULL")
and for entries which are null we have
df.filter("dt_mvmt is NULL")
PySpark provides various filtering options based on arithmetic, logical and other conditions. Presence of NULL values can hamper further processes. Removing them or statistically imputing them could be a choice.
Below set of code can be considered:
# Dataset is df
# Column name is dt_mvmt
# Before filtering make sure you have the right count of the dataset
df.count() # Some number
# Filter here
df = df.filter(df.dt_mvmt.isNotNull())
# Check the count to ensure there are NULL values present (This is important when dealing with large dataset)
df.count() # Count should be reduced if NULL values are present
If you want to filter out records having None value in column then see below example:
df=spark.createDataFrame([[123,"abc"],[234,"fre"],[345,None]],["a","b"])
Now filter out null value records:
df=df.filter(df.b.isNotNull())
df.show()
If you want to remove those records from DF then see below:
df1=df.na.drop(subset=['b'])
df1.show()
if column = None
COLUMN_OLD_VALUE
----------------
None
1
None
100
20
------------------
Use
create a temptable on data frame:
sqlContext.sql("select * from tempTable where column_old_value='None' ").show()
So use : column_old_value='None'
If you want to keep with the Pandas syntex this worked for me.
df = df[df.dt_mvmt.isNotNull()]
There are multiple ways you can remove/filter the null values from a column in DataFrame.
Lets create a simple DataFrame with below code:
date = ['2016-03-27','2016-03-28','2016-03-29', None, '2016-03-30','2016-03-31']
df = spark.createDataFrame(date, StringType())
Now you can try one of the below approach to filter out the null values.
# Approach - 1
df.filter("value is not null").show()
# Approach - 2
df.filter(col("value").isNotNull()).show()
# Approach - 3
df.filter(df["value"].isNotNull()).show()
# Approach - 4
df.filter(df.value.isNotNull()).show()
# Approach - 5
df.na.drop(subset=["value"]).show()
# Approach - 6
df.dropna(subset=["value"]).show()
# Note: You can also use where function instead of a filter.
You can also check the section “Working with NULL Values” on my blog for more information.
I hope it helps.
None/Null is a data type of the class NoneType in PySpark/Python
so, below will not work as you are trying to compare NoneType object with the string object
Wrong way of filreting
df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0
correct
df=df.where(col("dt_mvmt").isNotNull())
returns all records with dt_mvmt
as None/Null
isNull()
/isNotNull()
will return the respective rows which have dt_mvmt
as Null or !Null.
method_1 = df.filter(df['dt_mvmt'].isNotNull()).count()
method_2 = df.filter(df.dt_mvmt.isNotNull()).count()
Both will return the same result
in pyspark ,you can use:
from pyspark.sql.functions import *
df.filter(col("dt_mvmt").isNull()).show()
I’m trying to filter a PySpark dataframe that has None
as a row value:
df.select('dt_mvmt').distinct().collect()
[Row(dt_mvmt=u'2016-03-27'),
Row(dt_mvmt=u'2016-03-28'),
Row(dt_mvmt=u'2016-03-29'),
Row(dt_mvmt=None),
Row(dt_mvmt=u'2016-03-30'),
Row(dt_mvmt=u'2016-03-31')]
and I can filter correctly with an string value:
df[df.dt_mvmt == '2016-03-31']
# some results here
but this fails:
df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0
But there are definitely values on each category. What’s going on?
You can use Column.isNull
/ Column.isNotNull
:
df.where(col("dt_mvmt").isNull())
df.where(col("dt_mvmt").isNotNull())
If you want to simply drop NULL
values you can use na.drop
with subset
argument:
df.na.drop(subset=["dt_mvmt"])
Equality based comparisons with NULL
won’t work because in SQL NULL
is undefined so any attempt to compare it with another value returns NULL
:
sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## | null|
## +-------------+
sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## | null|
## +-------------------+
The only valid method to compare value with NULL
is IS
/ IS NOT
which are equivalent to the isNull
/ isNotNull
method calls.
Try to just use isNotNull function.
df.filter(df.dt_mvmt.isNotNull()).count()
To obtain entries whose values in the dt_mvmt
column are not null we have
df.filter("dt_mvmt is not NULL")
and for entries which are null we have
df.filter("dt_mvmt is NULL")
PySpark provides various filtering options based on arithmetic, logical and other conditions. Presence of NULL values can hamper further processes. Removing them or statistically imputing them could be a choice.
Below set of code can be considered:
# Dataset is df
# Column name is dt_mvmt
# Before filtering make sure you have the right count of the dataset
df.count() # Some number
# Filter here
df = df.filter(df.dt_mvmt.isNotNull())
# Check the count to ensure there are NULL values present (This is important when dealing with large dataset)
df.count() # Count should be reduced if NULL values are present
If you want to filter out records having None value in column then see below example:
df=spark.createDataFrame([[123,"abc"],[234,"fre"],[345,None]],["a","b"])
Now filter out null value records:
df=df.filter(df.b.isNotNull())
df.show()
If you want to remove those records from DF then see below:
df1=df.na.drop(subset=['b'])
df1.show()
if column = None
COLUMN_OLD_VALUE
----------------
None
1
None
100
20
------------------
Use
create a temptable on data frame:
sqlContext.sql("select * from tempTable where column_old_value='None' ").show()
So use : column_old_value='None'
If you want to keep with the Pandas syntex this worked for me.
df = df[df.dt_mvmt.isNotNull()]
There are multiple ways you can remove/filter the null values from a column in DataFrame.
Lets create a simple DataFrame with below code:
date = ['2016-03-27','2016-03-28','2016-03-29', None, '2016-03-30','2016-03-31']
df = spark.createDataFrame(date, StringType())
Now you can try one of the below approach to filter out the null values.
# Approach - 1
df.filter("value is not null").show()
# Approach - 2
df.filter(col("value").isNotNull()).show()
# Approach - 3
df.filter(df["value"].isNotNull()).show()
# Approach - 4
df.filter(df.value.isNotNull()).show()
# Approach - 5
df.na.drop(subset=["value"]).show()
# Approach - 6
df.dropna(subset=["value"]).show()
# Note: You can also use where function instead of a filter.
You can also check the section “Working with NULL Values” on my blog for more information.
I hope it helps.
None/Null is a data type of the class NoneType in PySpark/Python
so, below will not work as you are trying to compare NoneType object with the string object
Wrong way of filreting
df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0
correct
df=df.where(col("dt_mvmt").isNotNull())
returns all records with dt_mvmt
as None/Null
isNull()
/isNotNull()
will return the respective rows which have dt_mvmt
as Null or !Null.
method_1 = df.filter(df['dt_mvmt'].isNotNull()).count()
method_2 = df.filter(df.dt_mvmt.isNotNull()).count()
Both will return the same result
in pyspark ,you can use:
from pyspark.sql.functions import *
df.filter(col("dt_mvmt").isNull()).show()