using integer as index for multidimensional numpy array

Question:

I have boolean array of shape (n_samples, n_items) which represents a set: my_set[i, j] tells if sample i contains item j.

To populate it, the array is initialized as zeros, and receive another array of integers, with shape (n_samples, 3), telling for each example, three elements that belongs to it, for instance:

my_set = np.zeros((2, 5), dtype=bool)
init_values = np.array([[1,3,4], [0,1,2]], dtype=np.int64)

So, I need to fill my_set in row 0 and columns 1, 3, 4 and in row 1, columns 0, 1, 2, with with ones.

my_set contain valid values in appropriated range (that is, in [0, n_items)), and each column doesn’t contain duplicated items.

Some failed approaches:

  1. I know that a list of integers (or array) can be used as index, so I tried to use init_values as index straightforward, but it failed:
my_set[init_values] = 1
  File "<ipython-input-9-9b2c4d19f4f6>", line 1, in <cell line: 1>
    my_set[init_values] = 1
IndexError: index 3 is out of bounds for axis 0 with size 2
  1. I don’t know why the 3 is indexing over the first axis, so I tried a second approach: "pick up all rows and index only desired columns", using a mix of slicing and integer index. And it didn’t throw error, but didn’t worked as expected: checkout the shape, I expect it to be (2, 3), however…
my_set[:, init_values].shape
Out[11]: (2, 2, 3)
  1. Not sure why it didn’t work, but at least the first axis looks correct, so I tried to pick up only the first column, which is a list of integers, and therefore it is "more natural"… once again, it didn’t worked:
my_set[:, init_values[:,0]].shape
Out[12]: (2, 2)

I expected this shape to be (2, 1) since I wanted all rows with a single column on each, corresponding to the indexes given in init_values.

  1. I decided to go back to integer index approach for the first axis…. and it worked:
my_set[np.arange(len(my_set)), init_values[:,0]].shape
Out[13]: (2,)

However, it only works wor one column, so I need to iterate over columns to make it really work, but it looks like a good-initial workaround.

Current solution

So, to solve my original problem, I wrote this:

for c in range(init_values.shape[1])
    my_set[np.arange(len(my_set)), init_values[:,c]] = 1

# now lets check my_set is properly filled
print(my_set)
Out[14]: [[False  True False  True  True]
          [ True  True  True False False]]

which is exactly what I need.

Question(s):

That said, here goes my main question:

Is there a more efficient way to do this? I see it quite inefficient as the number of elements grows (for this example I used 3 but I actually need larger values).

In addition to this I’d like to understand why using np.arange on the first index behaves different from slicing it as :: I didn’t expect this behavior.

Any other comment to understand why previous approaches failed, are also welcome.

Asked By: Rodrigo Laguna

||

Answers:

You only have column indices, so you also need to create their corresponding row indices:

>>> my_set[np.arange(len(my_set))[:, None], init_values] = 1
>>> my_set
array([[False,  True, False,  True,  True],
       [ True,  True,  True, False, False]])

[:, None] is used to convert the row indices row vector to the column vector, so that row and column indices have compatible shapes for broadcasting:

>>> np.arange(len(my_set))[:, None]
array([[0],
       [1]])
>>> np.broadcast_arrays(np.arange(len(my_set))[:, None], init_values)
[array([[0, 0, 0],
        [1, 1, 1]]),
 array([[1, 3, 4],
        [0, 1, 2]], dtype=int64)]

The essence of slicing is to apply the index of other dimensions to each index in the slicing range of this dimension. Here is a simple test. The matrix to be indexed is as follows:

>>> ar = np.arange(4).reshape(2, 2)
>>> ar
array([[0, 1],
       [2, 3]])

If you want to get elements whit indices 0 and 1 in row 0, and elements with indices 1 and 0 in row 1, but you use the combination of column indices [[0, 1], [1, 0]] and slice, you will get:

>>> ar[:, [[0, 1], [1, 0]]]
array([[[0, 1],
        [1, 0]],

       [[2, 3],
        [3, 2]]])

This is equivalent to combining the row index from 0 to 1 with the column indices respectively:

>>> ar[0, [[0, 1], [1, 0]]]
array([[0, 1],
       [1, 0]])
>>> ar[1, [[0, 1], [1, 0]]]
array([[2, 3],
       [3, 2]])

In fact, broadcasting is used secretly here. The actual indices are:

>>> np.broadcast_arrays(0, [[0, 1], [1, 0]])
[array([[0, 0],
        [0, 0]]),
 array([[0, 1],
        [1, 0]])]
>>> np.broadcast_arrays(1, [[0, 1], [1, 0]])
[array([[1, 1],
        [1, 1]]),
 array([[0, 1],
        [1, 0]])]

This is not the same as the indices you actually need. Therefore, you need to manually generate the correct row indices for broadcasting:

>>> ar[[[0], [1]], [[0, 1], [1, 0]]]
array([[0, 1],
       [3, 2]])
>>> np.broadcast_arrays([[0], [1]], [[0, 1], [1, 0]])
[array([[0, 0],
        [1, 1]]),
 array([[0, 1],
        [1, 0]])]
Answered By: Mechanic Pig