Why does Python copy NumPy arrays where the length of the dimensions are the same?

Question:

I have a problem with referencing to a NumPy array.
I have an array of the form

import numpy as np
a = [np.array([0.0, 0.2, 0.4, 0.6, 0.8]),
     np.array([0.0, 0.2, 0.4, 0.6, 0.8]),
     np.array([0.0, 0.2, 0.4, 0.6, 0.8])]

If I now create a new variable,

b = np.array(a)

and do

b[0] += 1
print(a)

then a is not changing.

a = [array([0. , 0.2, 0.4, 0.6, 0.8]),
     array([0. , 0.2, 0.4, 0.6, 0.8]),
     array([0. , 0.2, 0.4, 0.6, 0.8])]

But if I do the same thing with:

a = [np.array([0.0, 0.2, 0.4, 0.6, 0.8]),
     np.array([0.0, 0.2, 0.4, 0.6, 0.8]),
     np.array([0.0, 0.2, 0.4, 0.6])]

so I removed one number in the end of the last dimension. Then I do this again:

b = np.array(a)
b[0] += 1
print(a)

Now a is changing, what I thought is the normal behavior in Python.

a = [array([1. , 1.2, 1.4, 1.6, 1.8]),
     array([0. , 0.2, 0.4, 0.6, 0.8]),
     array([0. , 0.2, 0.4, 0.6])]

Can anybody explain me this?

Asked By: sholli

||

Answers:

In the first case, NumPy sees that the input to numpy.array can be interpreted as a 3×5, 2-dimensional array-like, so it does that. The result is a new array of float64 dtype, with the input data copied into it, independent of the input object. b[0] is a view of the new array’s first row, completely independent of a[0], and modifying b[0] does not affect a[0].

In the second case, since the lengths of the subarrays are unequal, the input cannot be interpreted as a 2-dimensional array-like. However, considering the subarrays as opaque objects, the list can be interpreted as a 1-dimensional array-like of objects, which is the interpretation NumPy falls back on. The result of the numpy.array call is a 1-dimensional array of object dtype, containing references to the array objects that were elements of the input list. b[0] is the same array object that a[0] is, and b[0] += 1 mutates that object.

This length dependence is one of the many reasons that trying to make jagged arrays or arrays of arrays is a really, really bad idea in NumPy. Seriously, don’t do it.

Answered By: user2357112

In a nutshell, this is a consequence of your data. You’ll notice that this works/does not work (depending on how you view it) because your arrays are not equally sized.

With equal sized sub-arrays, the elements can be compactly loaded into a memory efficient scheme where any N-D array can be represented by a compact 1-D array in memory. NumPy then handles the translation of multi-dimensional indexes to 1D indexes internally. For example, index [i, j] of a 2D array will map to i*N + j (if storing in row major format). The data from the original list of arrays is copied into a compact 1D array, so any modifications made to this array does not affect the original.

With ragged lists/arrays, this cannot be done. The array is effectively a python list, where each element is a python object. For efficiency, only the object references are copied and not the data. This is why you can mutate the original list elements in the second case but not the first.

Answered By: cs95
In [1]: a = [np.array([0.0, 0.2, 0.4, 0.6, 0.8]), 
   ...:      np.array([0.0, 0.2, 0.4, 0.6, 0.8]), 
   ...:      np.array([0.0, 0.2, 0.4, 0.6, 0.8])]                               
In [2]:                                                                         
In [2]: a                                                                       
Out[2]: 
[array([0. , 0.2, 0.4, 0.6, 0.8]),
 array([0. , 0.2, 0.4, 0.6, 0.8]),
 array([0. , 0.2, 0.4, 0.6, 0.8])]

a is a list of arrays. b is a 2d array.

In [3]: b = np.array(a)                                                         
In [4]: b                                                                       
Out[4]: 
array([[0. , 0.2, 0.4, 0.6, 0.8],
       [0. , 0.2, 0.4, 0.6, 0.8],
       [0. , 0.2, 0.4, 0.6, 0.8]])
In [5]: b[0] += 1                                                               
In [6]: b                                                                       
Out[6]: 
array([[1. , 1.2, 1.4, 1.6, 1.8],
       [0. , 0.2, 0.4, 0.6, 0.8],
       [0. , 0.2, 0.4, 0.6, 0.8]])

b gets values from a but does not contain any of the a objects. The underlying data structure of this b is very different from a, the list. If that isn’t clear, you may want to review the numpy basics (which talk about shape, strides, and data buffers).

In the second case, b is an object array, containing the same objects as a:

In [8]: b = np.array(a)                                                         
In [9]: b                                                                       
Out[9]: 
array([array([0. , 0.2, 0.4, 0.6, 0.8]), array([0. , 0.2, 0.4, 0.6, 0.8]),
       array([0. , 0.2, 0.4, 0.6])], dtype=object)

This b behaves a lot like the a – both contain arrays.

The construction of this object array is quite different from the 2d numeric array. I think of the numeric array as the default, or normal, numpy behavior, while the object array is a ‘concession’, giving us a useful tool, but one which does not have the calculation power of the multidimensional array.

It is easy to make an object array by mistake – some say too easy. It can be harder to make one reliably by design. FOr example with the original a, we have to do:

In [17]: b = np.empty(3, object)                                                
In [18]: b[:] = a[:]                                                            
In [19]: b                                                                      
Out[19]: 
array([array([0. , 0.2, 0.4, 0.6, 0.8]), array([0. , 0.2, 0.4, 0.6, 0.8]),
       array([0. , 0.2, 0.4, 0.6, 0.8])], dtype=object)

or even for i in range(3): b[i] = a[i]

Answered By: hpaulj

When you make a np.array with consistent lengths of lists, a new object np.ndarray of floats is created.

Thus, your a[0] and b[0] does not share the same reference.

a = [np.array([0.0, 0.2, 0.4, 0.6, 0.8]),
     np.array([0.0, 0.2, 0.4, 0.6, 0.8]),
     np.array([0.0, 0.2, 0.4, 0.6, 0.8])]
b = np.array(a)
id(a[0])
# 139663994327728
id(b[0])
# 139663994324672

However, with varying lengths of lists, np.array creates np.ndarray with object as its elements.

a2 = [np.array([0. , 0.2, 0.4, 0.6, 0.8]), 
     np.array([0. , 0.2, 0.4, 0.6, 0.8]), 
     np.array([0. , 0.2, 0.4, 0.6])]
b2 = np.array(a2)
b2
array([array([1. , 1.2, 1.4, 1.6, 1.8]), array([0. , 0.2, 0.4, 0.6, 0.8]),
       array([0. , 0.2, 0.4, 0.6])], dtype=object)

Where b2 is still keeping the same references from a2:

for s in a2:
    print(id(s))
# 139663994330128
# 139663994328448
# 139663994329488

for s in b2:
    print(id(s))
# 139663994330128
# 139663994328448
# 139663994329488

Which makes addition to b2[0] results in addition to a2[0].

Answered By: Chris

The primary use-case for which numpy.array() has been designed, is to create an n-dimensional array of numbers, where the numbers are all stored in numpy’s own efficiently designed internal structure.

Whenever it is possible to do this, numpy.array() will indeed do it.

(The efficiency of this internal structure would be your primary reason for using numpy ndarrays rather than Python lists, so the fact that the numbers are being copied should actually be a desirable/good thing for you)

When your a is a list of 3 ndarrays, each of size 5, it is clearly possible for numpy.array() to create an n-dimensional ndarray of numbers (specifically a 2-dimensional one, with shape (3,5)) .

So, any change to b[0] is actually a change to this internal data structure of numbers, which were all copied over from a.

When your a is a list of unequally sized ndarrays, it is no longer possible for numpy.array() to convert this into an n-dimensional array of shape (3,5).

So, the function does the next best thing it can do, which is, to treat each of the 3 ndarrays as an object, and return a 1-dimensional ndarray of those objects. The length of this returned ndarray is 3 (the number of objects). You can see this by printing b.shape (will print (3,) instead of (3,5)) and b.dtype (will print object instead of float64).

In this case, numpy.array() does not dive deeper into each of your 3 ndarrays to copy the numbers of those 3 ndarrays, since it is not going to create its own efficiently designed n-dimensional array of numbers — it is only going to return a 1-dimensional array of objects.

So, any change you make to b[0] can be also seen through a, since both a and b hold references to the same objects (the 3 ndarrays of unequal sizes).

Answered By: fountainhead

@coldspeed correctly explained why you see the difference in behavior. I just wanted to point out that copying is expected.

In the documentation you can see, that the function has a copy flag that is set True by default:

numpy.array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)

If a copy should only be done if necessary, use np.asarray instead.

In your example that does not really make a difference, because a is a list rather a numpy array, so it will always be copied.

If a was an array the behavior would be as follows:

import numpy as np
a = np.array([[0.0, 0.2, 0.4, 0.6, 0.8],
              [0.0, 0.2, 0.4, 0.6, 0.8],
              [0.0, 0.2, 0.4, 0.6, 0.8]])
b=np.array(a)
b[0] += 1
a

Out[6]: 
array([[0. , 0.2, 0.4, 0.6, 0.8],
       [0. , 0.2, 0.4, 0.6, 0.8],
       [0. , 0.2, 0.4, 0.6, 0.8]])
c = np.asarray(a)
c[0] +=1
a

Out[9]: 
array([[1. , 1.2, 1.4, 1.6, 1.8],
       [0. , 0.2, 0.4, 0.6, 0.8],
       [0. , 0.2, 0.4, 0.6, 0.8]])
Answered By: Tim
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.