Efficiently checking if arbitrary object is NaN in Python / numpy / pandas?
Question:
My numpy arrays use np.nan
to designate missing values. As I iterate over the data set, I need to detect such missing values and handle them in special ways.
Naively I used numpy.isnan(val)
, which works well unless val
isn’t among the subset of types supported by numpy.isnan()
. For example, missing data can occur in string fields, in which case I get:
>>> np.isnan('some_string')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Not implemented for this type
Other than writing an expensive wrapper that catches the exception and returns False
, is there a way to handle this elegantly and efficiently?
Answers:
Is your type really arbitrary? If you know it is just going to be a int float or string you could just do
if val.dtype == float and np.isnan(val):
assuming it is wrapped in numpy , it will always have a dtype and only float and complex can be NaN
pandas.isnull()
(also pd.isna()
, in newer versions) checks for missing values in both numeric and string/object arrays. From the documentation, it checks for:
NaN in numeric arrays, None/NaN in object arrays
Quick example:
import pandas as pd
import numpy as np
s = pd.Series(['apple', np.nan, 'banana'])
pd.isnull(s)
Out[9]:
0 False
1 True
2 False
dtype: bool
The idea of using numpy.nan
to represent missing values is something that pandas
introduced, which is why pandas
has the tools to deal with it.
Datetimes too (if you use pd.NaT
you won’t need to specify the dtype)
In [24]: s = Series([Timestamp('20130101'),np.nan,Timestamp('20130102 9:30')],dtype='M8[ns]')
In [25]: s
Out[25]:
0 2013-01-01 00:00:00
1 NaT
2 2013-01-02 09:30:00
dtype: datetime64[ns]``
In [26]: pd.isnull(s)
Out[26]:
0 False
1 True
2 False
dtype: bool
I found this brilliant solution here, it uses the simple logic NAN!=NAN.
https://www.codespeedy.com/check-if-a-given-string-is-nan-in-python/
Using above example you can simply do the following. This should work on different type of objects as it simply utilize the fact that NAN is not equal to NAN.
import numpy as np
s = pd.Series(['apple', np.nan, 'banana'])
s.apply(lambda x: x!=x)
out[252]
0 False
1 True
2 False
dtype: bool
My numpy arrays use np.nan
to designate missing values. As I iterate over the data set, I need to detect such missing values and handle them in special ways.
Naively I used numpy.isnan(val)
, which works well unless val
isn’t among the subset of types supported by numpy.isnan()
. For example, missing data can occur in string fields, in which case I get:
>>> np.isnan('some_string')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Not implemented for this type
Other than writing an expensive wrapper that catches the exception and returns False
, is there a way to handle this elegantly and efficiently?
Is your type really arbitrary? If you know it is just going to be a int float or string you could just do
if val.dtype == float and np.isnan(val):
assuming it is wrapped in numpy , it will always have a dtype and only float and complex can be NaN
pandas.isnull()
(also pd.isna()
, in newer versions) checks for missing values in both numeric and string/object arrays. From the documentation, it checks for:
NaN in numeric arrays, None/NaN in object arrays
Quick example:
import pandas as pd
import numpy as np
s = pd.Series(['apple', np.nan, 'banana'])
pd.isnull(s)
Out[9]:
0 False
1 True
2 False
dtype: bool
The idea of using numpy.nan
to represent missing values is something that pandas
introduced, which is why pandas
has the tools to deal with it.
Datetimes too (if you use pd.NaT
you won’t need to specify the dtype)
In [24]: s = Series([Timestamp('20130101'),np.nan,Timestamp('20130102 9:30')],dtype='M8[ns]')
In [25]: s
Out[25]:
0 2013-01-01 00:00:00
1 NaT
2 2013-01-02 09:30:00
dtype: datetime64[ns]``
In [26]: pd.isnull(s)
Out[26]:
0 False
1 True
2 False
dtype: bool
I found this brilliant solution here, it uses the simple logic NAN!=NAN.
https://www.codespeedy.com/check-if-a-given-string-is-nan-in-python/
Using above example you can simply do the following. This should work on different type of objects as it simply utilize the fact that NAN is not equal to NAN.
import numpy as np
s = pd.Series(['apple', np.nan, 'banana'])
s.apply(lambda x: x!=x)
out[252]
0 False
1 True
2 False
dtype: bool