Detect if a NumPy array contains at least one non-numeric value?
Question:
I need to write a function which will detect if the input contains at least one value which is non-numeric. If a non-numeric value is found I will raise an error (because the calculation should only return a numeric value). The number of dimensions of the input array is not known in advance – the function should give the correct value regardless of ndim. As an extra complication the input could be a single float or numpy.float64
or even something oddball like a zero-dimensional array.
The obvious way to solve this is to write a recursive function which iterates over every iterable object in the array until it finds a non-iterabe. It will apply the numpy.isnan()
function over every non-iterable object. If at least one non-numeric value is found then the function will return False immediately. Otherwise if all the values in the iterable are numeric it will eventually return True.
That works just fine, but it’s pretty slow and I expect that NumPy has a much better way to do it. What is an alternative that is faster and more numpyish?
Here’s my mockup:
def contains_nan( myarray ):
"""
@param myarray : An n-dimensional array or a single float
@type myarray : numpy.ndarray, numpy.array, float
@returns: bool
Returns true if myarray is numeric or only contains numeric values.
Returns false if at least one non-numeric value exists
Not-A-Number is given by the numpy.isnan() function.
"""
return True
Answers:
This should be faster than iterating and will work regardless of shape.
numpy.isnan(myarray).any()
Edit: 30x faster:
import timeit
s = 'import numpy;a = numpy.arange(10000.).reshape((100,100));a[10,10]=numpy.nan'
ms = [
'numpy.isnan(a).any()',
'any(numpy.isnan(x) for x in a.flatten())']
for m in ms:
print " %.2f s" % timeit.Timer(m, s).timeit(1000), m
Results:
0.11 s numpy.isnan(a).any()
3.75 s any(numpy.isnan(x) for x in a.flatten())
Bonus: it works fine for non-array NumPy types:
>>> a = numpy.float64(42.)
>>> numpy.isnan(a).any()
False
>>> a = numpy.float64(numpy.nan)
>>> numpy.isnan(a).any()
True
With numpy 1.3 or svn you can do this
In [1]: a = arange(10000.).reshape(100,100)
In [3]: isnan(a.max())
Out[3]: False
In [4]: a[50,50] = nan
In [5]: isnan(a.max())
Out[5]: True
In [6]: timeit isnan(a.max())
10000 loops, best of 3: 66.3 µs per loop
The treatment of nans in comparisons was not consistent in earlier versions.
If infinity is a possible value, I would use numpy.isfinite
numpy.isfinite(myarray).all()
If the above evaluates to True
, then myarray
contains none of numpy.nan
, numpy.inf
or -numpy.inf
.
numpy.isnan
will be OK with numpy.inf
values, for example:
In [11]: import numpy as np
In [12]: b = np.array([[4, np.inf],[np.nan, -np.inf]])
In [13]: np.isnan(b)
Out[13]:
array([[False, False],
[ True, False]], dtype=bool)
In [14]: np.isfinite(b)
Out[14]:
array([[ True, False],
[False, False]], dtype=bool)
(np.where(np.isnan(A)))[0].shape[0]
will be greater than 0
if A
contains at least one element of nan
, A
could be an n x m
matrix.
Example:
import numpy as np
A = np.array([1,2,4,np.nan])
if (np.where(np.isnan(A)))[0].shape[0]:
print "A contains nan"
else:
print "A does not contain nan"
Pfft! Microseconds!
Never solve a problem in microseconds that can be solved in nanoseconds.
Note that the accepted answer:
- iterates over the whole data, regardless of whether a nan is found
- creates a temporary array of size N, which is redundant.
A better solution is to return True immediately when NAN is found:
import numba
import numpy as np
NAN = float("nan")
@numba.njit(nogil=True)
def _any_nans(a):
for x in a:
if np.isnan(x): return True
return False
@numba.jit
def any_nans(a):
if not a.dtype.kind=='f': return False
return _any_nans(a.flat)
array1M = np.random.rand(1000000)
assert any_nans(array1M)==False
%timeit any_nans(array1M) # 573us
array1M[0] = NAN
assert any_nans(array1M)==True
%timeit any_nans(array1M) # 774ns (!nanoseconds)
and works for n-dimensions:
array1M_nd = array1M.reshape((len(array1M)/2, 2))
assert any_nans(array1M_nd)==True
%timeit any_nans(array1M_nd) # 774ns
Compare this to the numpy native solution:
def any_nans(a):
if not a.dtype.kind=='f': return False
return np.isnan(a).any()
array1M = np.random.rand(1000000)
assert any_nans(array1M)==False
%timeit any_nans(array1M) # 456us
array1M[0] = NAN
assert any_nans(array1M)==True
%timeit any_nans(array1M) # 470us
%timeit np.isnan(array1M).any() # 532us
The early-exit method is 3 orders or magnitude speedup (in some cases).
Not too shabby for a simple annotation.
I need to write a function which will detect if the input contains at least one value which is non-numeric. If a non-numeric value is found I will raise an error (because the calculation should only return a numeric value). The number of dimensions of the input array is not known in advance – the function should give the correct value regardless of ndim. As an extra complication the input could be a single float or numpy.float64
or even something oddball like a zero-dimensional array.
The obvious way to solve this is to write a recursive function which iterates over every iterable object in the array until it finds a non-iterabe. It will apply the numpy.isnan()
function over every non-iterable object. If at least one non-numeric value is found then the function will return False immediately. Otherwise if all the values in the iterable are numeric it will eventually return True.
That works just fine, but it’s pretty slow and I expect that NumPy has a much better way to do it. What is an alternative that is faster and more numpyish?
Here’s my mockup:
def contains_nan( myarray ):
"""
@param myarray : An n-dimensional array or a single float
@type myarray : numpy.ndarray, numpy.array, float
@returns: bool
Returns true if myarray is numeric or only contains numeric values.
Returns false if at least one non-numeric value exists
Not-A-Number is given by the numpy.isnan() function.
"""
return True
This should be faster than iterating and will work regardless of shape.
numpy.isnan(myarray).any()
Edit: 30x faster:
import timeit
s = 'import numpy;a = numpy.arange(10000.).reshape((100,100));a[10,10]=numpy.nan'
ms = [
'numpy.isnan(a).any()',
'any(numpy.isnan(x) for x in a.flatten())']
for m in ms:
print " %.2f s" % timeit.Timer(m, s).timeit(1000), m
Results:
0.11 s numpy.isnan(a).any()
3.75 s any(numpy.isnan(x) for x in a.flatten())
Bonus: it works fine for non-array NumPy types:
>>> a = numpy.float64(42.)
>>> numpy.isnan(a).any()
False
>>> a = numpy.float64(numpy.nan)
>>> numpy.isnan(a).any()
True
With numpy 1.3 or svn you can do this
In [1]: a = arange(10000.).reshape(100,100)
In [3]: isnan(a.max())
Out[3]: False
In [4]: a[50,50] = nan
In [5]: isnan(a.max())
Out[5]: True
In [6]: timeit isnan(a.max())
10000 loops, best of 3: 66.3 µs per loop
The treatment of nans in comparisons was not consistent in earlier versions.
If infinity is a possible value, I would use numpy.isfinite
numpy.isfinite(myarray).all()
If the above evaluates to True
, then myarray
contains none of numpy.nan
, numpy.inf
or -numpy.inf
.
numpy.isnan
will be OK with numpy.inf
values, for example:
In [11]: import numpy as np
In [12]: b = np.array([[4, np.inf],[np.nan, -np.inf]])
In [13]: np.isnan(b)
Out[13]:
array([[False, False],
[ True, False]], dtype=bool)
In [14]: np.isfinite(b)
Out[14]:
array([[ True, False],
[False, False]], dtype=bool)
(np.where(np.isnan(A)))[0].shape[0]
will be greater than 0
if A
contains at least one element of nan
, A
could be an n x m
matrix.
Example:
import numpy as np
A = np.array([1,2,4,np.nan])
if (np.where(np.isnan(A)))[0].shape[0]:
print "A contains nan"
else:
print "A does not contain nan"
Pfft! Microseconds!
Never solve a problem in microseconds that can be solved in nanoseconds.
Note that the accepted answer:
- iterates over the whole data, regardless of whether a nan is found
- creates a temporary array of size N, which is redundant.
A better solution is to return True immediately when NAN is found:
import numba
import numpy as np
NAN = float("nan")
@numba.njit(nogil=True)
def _any_nans(a):
for x in a:
if np.isnan(x): return True
return False
@numba.jit
def any_nans(a):
if not a.dtype.kind=='f': return False
return _any_nans(a.flat)
array1M = np.random.rand(1000000)
assert any_nans(array1M)==False
%timeit any_nans(array1M) # 573us
array1M[0] = NAN
assert any_nans(array1M)==True
%timeit any_nans(array1M) # 774ns (!nanoseconds)
and works for n-dimensions:
array1M_nd = array1M.reshape((len(array1M)/2, 2))
assert any_nans(array1M_nd)==True
%timeit any_nans(array1M_nd) # 774ns
Compare this to the numpy native solution:
def any_nans(a):
if not a.dtype.kind=='f': return False
return np.isnan(a).any()
array1M = np.random.rand(1000000)
assert any_nans(array1M)==False
%timeit any_nans(array1M) # 456us
array1M[0] = NAN
assert any_nans(array1M)==True
%timeit any_nans(array1M) # 470us
%timeit np.isnan(array1M).any() # 532us
The early-exit method is 3 orders or magnitude speedup (in some cases).
Not too shabby for a simple annotation.