gotchas where Numpy differs from straight python?
Question:
Folks,
is there a collection of gotchas where Numpy differs from python,
points that have puzzled and cost time ?
“The horror of that moment I shall
never never forget !”
“You will, though,” the Queen said, “if you don’t
make a memorandum of it.”
For example, NaNs are always trouble, anywhere.
If you can explain this without running it, give yourself a point —
from numpy import array, NaN, isnan
pynan = float("nan")
print pynan is pynan, pynan is NaN, NaN is NaN
a = (0, pynan)
print a, a[1] is pynan, any([aa is pynan for aa in a])
a = array(( 0, NaN ))
print a, a[1] is NaN, isnan( a[1] )
(I’m not knocking numpy, lots of good work there, just think a FAQ or Wiki of gotchas would be useful.)
Edit: I was hoping to collect half a dozen gotchas (surprises for people learning Numpy).
Then, if there are common gotchas or, better, common explanations,
we could talk about adding them to a community Wiki (where ?)
It doesn’t look like we have enough so far.
Answers:
print pynan is pynan, pynan is NaN, NaN is NaN
This tests identity, that is if it is the same object. The result should therefore obviously be True, False, True, because when you do float(whatever) you are creating a new float object.
a = (0, pynan)
print a, a[1] is pynan, any([aa is pynan for aa in a])
I don’t know what it is that you find surprising with this.
a = array(( 0, NaN ))
print a, a[1] is NaN, isnan( a[1] )
This I did have to run. 🙂 When you stick NaN into an array it’s converted into a numpy.float64 object, which is why a[1] is NaN fails.
This all seems fairly unsurprising to me. But then I don’t really know anything much about NumPy. 🙂
NaN
is not a singleton like None
, so you can’t really use the is check on it. What makes it a bit tricky is that NaN == NaN
is False
as IEEE-754 requires. That’s why you need to use the numpy.isnan()
function to check if a float is not a number. Or the standard library math.isnan()
if you’re using Python 2.6+.
I think this one is funny:
>>> import numpy as n
>>> a = n.array([[1,2],[3,4]])
>>> a[1], a[0] = a[0], a[1]
>>> a
array([[1, 2],
[1, 2]])
For Python lists on the other hand this works as intended:
>>> b = [[1,2],[3,4]]
>>> b[1], b[0] = b[0], b[1]
>>> b
[[3, 4], [1, 2]]
Funny side note: numpy itself had a bug in the shuffle
function, because it used that notation 🙂 (see here).
The reason is that in the first case we are dealing with views of the array, so the values are overwritten in-place.
The biggest gotcha for me was that almost every standard operator is overloaded to distribute across the array.
Define a list and an array
>>> l = range(10)
>>> l
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> import numpy
>>> a = numpy.array(l)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Multiplication duplicates the python list, but distributes over the numpy array
>>> l * 2
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> a * 2
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
Addition and division are not defined on python lists
>>> l + 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "int") to list
>>> a + 2
array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
>>> l / 2.0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for /: 'list' and 'float'
>>> a / 2.0
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
Numpy overloads to treat lists like arrays sometimes
>>> a + a
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
>>> a + l
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
from Neil Martinsen-Burrell in numpy-discussion 7 Sept —
The ndarray type available in Numpy is
not conceptually an extension of
Python’s iterables. If you’d like to
help other Numpy users with this
issue, you can edit the documentation
in the online documentation editor at
numpy-docs
The truth value of a Numpy array differs from that of a python sequence type, where any non-empty sequence is true.
>>> import numpy as np
>>> l = [0,1,2,3]
>>> a = np.arange(4)
>>> if l: print "Im true"
...
Im true
>>> if a: print "Im true"
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use
a.any() or a.all()
>>>
The numerical types are true when they are non-zero and as a collection of numbers, the numpy array inherits this definition. But with a collection of numbers, truth could reasonably mean “all elements are non-zero” or “at least one element is non-zero”. Numpy refuses to guess which definition is meant and raises the above exception. Using the .any()
and .all()
methods allows one to specify which meaning of true is meant.
>>> if a.any(): print "Im true"
...
Im true
>>> if a.all(): print "Im true"
...
>>>
Slicing creates views, not copies.
>>> l = [1, 2, 3, 4]
>>> s = l[2:3]
>>> s[0] = 5
>>> l
[1, 2, 3, 4]
>>> a = array([1, 2, 3, 4])
>>> s = a[2:3]
>>> s[0] = 5
>>> a
array([1, 2, 5, 4])
I found the fact that multiplying up lists of elements just creates view of elements caught me out.
>>> a=[0]*5
>>>a
[0,0,0,0,0]
>>>a[2] = 1
>>>a
[0,0,1,0,0]
>>>b = [np.ones(3)]*5
>>>b
[array([ 1., 1., 1.]), array([ 1., 1., 1.]), array([ 1., 1., 1.]), array([ 1., 1., 1.]), array([ 1., 1., 1.])]
>>>b[2][1] = 2
>>>b
[array([ 1., 2., 1.]), array([ 1., 2., 1.]), array([ 1., 2., 1.]), array([ 1., 2., 1.]), array([ 1., 2., 1.])]
So if you create a list of elements like this and intend to do different operations on them you are scuppered …
A straightforward solution is to iteratively create each of the arrays (using a ‘for loop’ or list comprehension) or use a higher dimensional array (where e.g. each of these 1D arrays is a row in your 2D array, which is generally faster).
Because __eq__
does not return a bool, using numpy arrays in any kind of containers prevents equality testing without a container-specific work around.
Example:
>>> import numpy
>>> a = numpy.array(range(3))
>>> b = numpy.array(range(3))
>>> a == b
array([ True, True, True], dtype=bool)
>>> x = (a, 'banana')
>>> y = (b, 'banana')
>>> x == y
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
This is a horrible problem. For example, you cannot write unittests for containers which use TestCase.assertEqual()
and must instead write custom comparison functions. Suppose we write a work-around function special_eq_for_numpy_and_tuples
. Now we can do this in a unittest:
x = (array1, 'deserialized')
y = (array2, 'deserialized')
self.failUnless( special_eq_for_numpy_and_tuples(x, y) )
Now we must do this for every container type we might use to store numpy arrays. Furthermore, __eq__
might return a bool rather than an array of bools:
>>> a = numpy.array(range(3))
>>> b = numpy.array(range(5))
>>> a == b
False
Now each of our container-specific equality comparison functions must also handle that special case.
Maybe we can patch over this wart with a subclass?
>>> class SaneEqualityArray (numpy.ndarray):
... def __eq__(self, other):
... return isinstance(other, SaneEqualityArray) and self.shape == other.shape and (numpy.ndarray.__eq__(self, other)).all()
...
>>> a = SaneEqualityArray( (2, 3) )
>>> a.fill(7)
>>> b = SaneEqualityArray( (2, 3) )
>>> b.fill(7)
>>> a == b
True
>>> x = (a, 'banana')
>>> y = (b, 'banana')
>>> x == y
True
>>> c = SaneEqualityArray( (7, 7) )
>>> c.fill(7)
>>> a == c
False
That seems to do the right thing. The class should also explicitly export elementwise comparison, since that is often useful.
(Related, but a NumPy vs. SciPy gotcha, rather than NumPy vs Python)
Slicing beyond an array’s real size works differently:
>>> import numpy, scipy.sparse
>>> m = numpy.random.rand(2, 5) # create a 2x5 dense matrix
>>> print m[:3, :] # works like list slicing in Python: clips to real size
[[ 0.12245393 0.20642799 0.98128601 0.06102106 0.74091038]
[ 0.0527411 0.9131837 0.6475907 0.27900378 0.22396443]]
>>> s = scipy.sparse.lil_matrix(m) # same for csr_matrix and other sparse formats
>>> print s[:3, :] # doesn't clip!
IndexError: row index out of bounds
So when slicing scipy.sparse
arrays, you must make manually sure your slice bounds are within range. This differs from how both NumPy and plain Python work.
Not such a big gotcha: With boolean slicing, I sometimes wish I could do
x[ 3 <= y < 7 ]
like the python double comparison. Instead, I have to write
x[ np.logical_and(3<=y, y<7) ]
(Unless you know something better?)
Also, np.logical_and and np.logical_or only take two arguments each, I would like them to take a variable number, or a list, so I could feed in more than just two logical clauses.
(numpy 1.3, maybe this has all changed in later versions.)
In [1]: bool([])
Out[1]: False
In [2]: bool(array([]))
Out[2]: False
In [3]: bool([0])
Out[3]: True
In [4]: bool(array([0]))
Out[4]: False
So don’t test for the emptiness of an array by checking its truth value. Use size(array())
.
And don’t use len(array())
, either:
In [1]: size(array([]))
Out[1]: 0
In [2]: len(array([]))
Out[2]: 0
In [3]: size(array([0]))
Out[3]: 1
In [4]: len(array([0]))
Out[4]: 1
In [5]: size(array(0))
Out[5]: 1
In [6]: len(array(0))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-5b2872696128> in <module>()
----> 1 len(array(0))
TypeError: len() of unsized object
No one seems to have mentioned this so far:
>>> all(False for i in range(3))
False
>>> from numpy import all
>>> all(False for i in range(3))
True
>>> any(False for i in range(3))
False
>>> from numpy import any
>>> any(False for i in range(3))
True
numpy’s any
and all
don’t play nicely with generators, and don’t raise any error warning you that they don’t.
A 0-d array of None looks like None but it is not the same:
In [1]: print None
None
In [2]: import numpy
In [3]: print numpy.array(None)
None
In [4]: numpy.array(None) is None
Out[4]: False
In [5]: numpy.array(None) == None
Out[5]: False
In [6]: print repr(numpy.array(None))
array(None, dtype=object)
A surprise with the *=
assignment in combination with numpy.array
:
>>> from numpy import array
>>> a = array([1, 2, 3])
>>> a *= 1.1
>>> print(a)
[1 2 3] # not quite what we expect or would like to see
>>> print(a.dtype)
int64 # and this is why
>>> a = 1.1 * a # here, a new array is created
>>> print(a, a.dtype)
[ 1.1 2.2 3.3] float64 # with the expected outcome
Surprising, annoying, but understandable. The *=
operator will not change the type of the array
data, thereby multiplication of an int
array
by a float
will fail in the conventional meaning of this multiplication. The Python version a = 1; a *= 1.1
in the other hand works as expected.
Folks,
is there a collection of gotchas where Numpy differs from python,
points that have puzzled and cost time ?
“The horror of that moment I shall
never never forget !”
“You will, though,” the Queen said, “if you don’t
make a memorandum of it.”
For example, NaNs are always trouble, anywhere.
If you can explain this without running it, give yourself a point —
from numpy import array, NaN, isnan
pynan = float("nan")
print pynan is pynan, pynan is NaN, NaN is NaN
a = (0, pynan)
print a, a[1] is pynan, any([aa is pynan for aa in a])
a = array(( 0, NaN ))
print a, a[1] is NaN, isnan( a[1] )
(I’m not knocking numpy, lots of good work there, just think a FAQ or Wiki of gotchas would be useful.)
Edit: I was hoping to collect half a dozen gotchas (surprises for people learning Numpy).
Then, if there are common gotchas or, better, common explanations,
we could talk about adding them to a community Wiki (where ?)
It doesn’t look like we have enough so far.
print pynan is pynan, pynan is NaN, NaN is NaN
This tests identity, that is if it is the same object. The result should therefore obviously be True, False, True, because when you do float(whatever) you are creating a new float object.
a = (0, pynan)
print a, a[1] is pynan, any([aa is pynan for aa in a])
I don’t know what it is that you find surprising with this.
a = array(( 0, NaN ))
print a, a[1] is NaN, isnan( a[1] )
This I did have to run. 🙂 When you stick NaN into an array it’s converted into a numpy.float64 object, which is why a[1] is NaN fails.
This all seems fairly unsurprising to me. But then I don’t really know anything much about NumPy. 🙂
NaN
is not a singleton like None
, so you can’t really use the is check on it. What makes it a bit tricky is that NaN == NaN
is False
as IEEE-754 requires. That’s why you need to use the numpy.isnan()
function to check if a float is not a number. Or the standard library math.isnan()
if you’re using Python 2.6+.
I think this one is funny:
>>> import numpy as n
>>> a = n.array([[1,2],[3,4]])
>>> a[1], a[0] = a[0], a[1]
>>> a
array([[1, 2],
[1, 2]])
For Python lists on the other hand this works as intended:
>>> b = [[1,2],[3,4]]
>>> b[1], b[0] = b[0], b[1]
>>> b
[[3, 4], [1, 2]]
Funny side note: numpy itself had a bug in the shuffle
function, because it used that notation 🙂 (see here).
The reason is that in the first case we are dealing with views of the array, so the values are overwritten in-place.
The biggest gotcha for me was that almost every standard operator is overloaded to distribute across the array.
Define a list and an array
>>> l = range(10)
>>> l
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> import numpy
>>> a = numpy.array(l)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Multiplication duplicates the python list, but distributes over the numpy array
>>> l * 2
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> a * 2
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
Addition and division are not defined on python lists
>>> l + 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "int") to list
>>> a + 2
array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
>>> l / 2.0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for /: 'list' and 'float'
>>> a / 2.0
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
Numpy overloads to treat lists like arrays sometimes
>>> a + a
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
>>> a + l
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
from Neil Martinsen-Burrell in numpy-discussion 7 Sept —
The ndarray type available in Numpy is
not conceptually an extension of
Python’s iterables. If you’d like to
help other Numpy users with this
issue, you can edit the documentation
in the online documentation editor at
numpy-docs
The truth value of a Numpy array differs from that of a python sequence type, where any non-empty sequence is true.
>>> import numpy as np
>>> l = [0,1,2,3]
>>> a = np.arange(4)
>>> if l: print "Im true"
...
Im true
>>> if a: print "Im true"
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use
a.any() or a.all()
>>>
The numerical types are true when they are non-zero and as a collection of numbers, the numpy array inherits this definition. But with a collection of numbers, truth could reasonably mean “all elements are non-zero” or “at least one element is non-zero”. Numpy refuses to guess which definition is meant and raises the above exception. Using the .any()
and .all()
methods allows one to specify which meaning of true is meant.
>>> if a.any(): print "Im true"
...
Im true
>>> if a.all(): print "Im true"
...
>>>
Slicing creates views, not copies.
>>> l = [1, 2, 3, 4]
>>> s = l[2:3]
>>> s[0] = 5
>>> l
[1, 2, 3, 4]
>>> a = array([1, 2, 3, 4])
>>> s = a[2:3]
>>> s[0] = 5
>>> a
array([1, 2, 5, 4])
I found the fact that multiplying up lists of elements just creates view of elements caught me out.
>>> a=[0]*5
>>>a
[0,0,0,0,0]
>>>a[2] = 1
>>>a
[0,0,1,0,0]
>>>b = [np.ones(3)]*5
>>>b
[array([ 1., 1., 1.]), array([ 1., 1., 1.]), array([ 1., 1., 1.]), array([ 1., 1., 1.]), array([ 1., 1., 1.])]
>>>b[2][1] = 2
>>>b
[array([ 1., 2., 1.]), array([ 1., 2., 1.]), array([ 1., 2., 1.]), array([ 1., 2., 1.]), array([ 1., 2., 1.])]
So if you create a list of elements like this and intend to do different operations on them you are scuppered …
A straightforward solution is to iteratively create each of the arrays (using a ‘for loop’ or list comprehension) or use a higher dimensional array (where e.g. each of these 1D arrays is a row in your 2D array, which is generally faster).
Because __eq__
does not return a bool, using numpy arrays in any kind of containers prevents equality testing without a container-specific work around.
Example:
>>> import numpy
>>> a = numpy.array(range(3))
>>> b = numpy.array(range(3))
>>> a == b
array([ True, True, True], dtype=bool)
>>> x = (a, 'banana')
>>> y = (b, 'banana')
>>> x == y
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
This is a horrible problem. For example, you cannot write unittests for containers which use TestCase.assertEqual()
and must instead write custom comparison functions. Suppose we write a work-around function special_eq_for_numpy_and_tuples
. Now we can do this in a unittest:
x = (array1, 'deserialized')
y = (array2, 'deserialized')
self.failUnless( special_eq_for_numpy_and_tuples(x, y) )
Now we must do this for every container type we might use to store numpy arrays. Furthermore, __eq__
might return a bool rather than an array of bools:
>>> a = numpy.array(range(3))
>>> b = numpy.array(range(5))
>>> a == b
False
Now each of our container-specific equality comparison functions must also handle that special case.
Maybe we can patch over this wart with a subclass?
>>> class SaneEqualityArray (numpy.ndarray):
... def __eq__(self, other):
... return isinstance(other, SaneEqualityArray) and self.shape == other.shape and (numpy.ndarray.__eq__(self, other)).all()
...
>>> a = SaneEqualityArray( (2, 3) )
>>> a.fill(7)
>>> b = SaneEqualityArray( (2, 3) )
>>> b.fill(7)
>>> a == b
True
>>> x = (a, 'banana')
>>> y = (b, 'banana')
>>> x == y
True
>>> c = SaneEqualityArray( (7, 7) )
>>> c.fill(7)
>>> a == c
False
That seems to do the right thing. The class should also explicitly export elementwise comparison, since that is often useful.
(Related, but a NumPy vs. SciPy gotcha, rather than NumPy vs Python)
Slicing beyond an array’s real size works differently:
>>> import numpy, scipy.sparse
>>> m = numpy.random.rand(2, 5) # create a 2x5 dense matrix
>>> print m[:3, :] # works like list slicing in Python: clips to real size
[[ 0.12245393 0.20642799 0.98128601 0.06102106 0.74091038]
[ 0.0527411 0.9131837 0.6475907 0.27900378 0.22396443]]
>>> s = scipy.sparse.lil_matrix(m) # same for csr_matrix and other sparse formats
>>> print s[:3, :] # doesn't clip!
IndexError: row index out of bounds
So when slicing scipy.sparse
arrays, you must make manually sure your slice bounds are within range. This differs from how both NumPy and plain Python work.
Not such a big gotcha: With boolean slicing, I sometimes wish I could do
x[ 3 <= y < 7 ]
like the python double comparison. Instead, I have to write
x[ np.logical_and(3<=y, y<7) ]
(Unless you know something better?)
Also, np.logical_and and np.logical_or only take two arguments each, I would like them to take a variable number, or a list, so I could feed in more than just two logical clauses.
(numpy 1.3, maybe this has all changed in later versions.)
In [1]: bool([])
Out[1]: False
In [2]: bool(array([]))
Out[2]: False
In [3]: bool([0])
Out[3]: True
In [4]: bool(array([0]))
Out[4]: False
So don’t test for the emptiness of an array by checking its truth value. Use size(array())
.
And don’t use len(array())
, either:
In [1]: size(array([]))
Out[1]: 0
In [2]: len(array([]))
Out[2]: 0
In [3]: size(array([0]))
Out[3]: 1
In [4]: len(array([0]))
Out[4]: 1
In [5]: size(array(0))
Out[5]: 1
In [6]: len(array(0))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-5b2872696128> in <module>()
----> 1 len(array(0))
TypeError: len() of unsized object
No one seems to have mentioned this so far:
>>> all(False for i in range(3))
False
>>> from numpy import all
>>> all(False for i in range(3))
True
>>> any(False for i in range(3))
False
>>> from numpy import any
>>> any(False for i in range(3))
True
numpy’s any
and all
don’t play nicely with generators, and don’t raise any error warning you that they don’t.
A 0-d array of None looks like None but it is not the same:
In [1]: print None
None
In [2]: import numpy
In [3]: print numpy.array(None)
None
In [4]: numpy.array(None) is None
Out[4]: False
In [5]: numpy.array(None) == None
Out[5]: False
In [6]: print repr(numpy.array(None))
array(None, dtype=object)
A surprise with the *=
assignment in combination with numpy.array
:
>>> from numpy import array
>>> a = array([1, 2, 3])
>>> a *= 1.1
>>> print(a)
[1 2 3] # not quite what we expect or would like to see
>>> print(a.dtype)
int64 # and this is why
>>> a = 1.1 * a # here, a new array is created
>>> print(a, a.dtype)
[ 1.1 2.2 3.3] float64 # with the expected outcome
Surprising, annoying, but understandable. The *=
operator will not change the type of the array
data, thereby multiplication of an int
array
by a float
will fail in the conventional meaning of this multiplication. The Python version a = 1; a *= 1.1
in the other hand works as expected.