Understanding weird boolean 2d-array indexing behavior in numpy

Question

Why does this work:

a = np.random.rand(10, 20)
x_range = np.arange(10)
y_range = np.arange(20)

a_tmp = a[x_range<5,:]
b = a_tmp[:, np.in1d(y_range, [3,4,8])]

and this does not:

a = np.random.rand(10,20)
x_range = np.arange(10)
y_range = np.arange(20)    

b = a[x_range<5, np.in1d(y_range,[3,4,8])]

Asked By: tillsten

||

Source

Answer 1

The Numpy reference documentation’s page on indexing contains the answers, but requires a bit of careful reading.

The answer here is that indexing with booleans is equivalent to indexing with integer arrays obtained by first transforming the boolean arrays with np.nonzero. Therefore, with boolean arrays m1, m2

a[m1, m2] == a[m1.nonzero(), m2.nonzero()]

which (when it succeeds, i.e., m1.nonzero().shape == m2.nonzero().shape) is equivalent to:

[a[i, i] for i in range(a.shape[0]) if m1[i] and m2[i]]

I’m not sure why it was designed to work like this — usually, this is not what you’d want.

To get the more intuitive result, you can instead do

a[np.ix_(m1, m2)]

which produces a result equivalent to

[[a[i,j] for j in range(a.shape[1]) if m2[j]] for i in range(a.shape[0]) if m1[i]]

Answered By: pv.

Answer 2

An alternative to np.ix_ is to convert the boolean arrays to integer arrays (using np.nonzero()), and then use np.newaxis to create arrays of the right shape to take advantage of broadcasting.

import numpy as np

a=np.random.rand(10,20)
x_range=np.arange(10)
y_range=np.arange(20)

a_tmp=a[x_range<5,:]
b_correct=a_tmp[:,np.in1d(y_range,[3,4,8])]

m1=(x_range<5).nonzero()[0]
m2=np.in1d(y_range,[3,4,8]).nonzero()
b=a[m1[:,np.newaxis], m2]
assert np.allclose(b,b_correct)

b2=a[np.ix_(x_range<5,np.in1d(y_range,[3,4,8]))]
assert np.allclose(b2,b_correct)

np.ix_ tends to be slower than double indexing.
The long-form solution appears to be a bit faster:

long-form:

In [83]: %timeit a[(x_range<5).nonzero()[0][:,np.newaxis], (np.in1d(y_range,[3,4,8])).nonzero()[0]]
10000 loops, best of 3: 131 us per loop

double indexing:

In [85]: %timeit a[x_range<5,:][:,np.in1d(y_range,[3,4,8])]
10000 loops, best of 3: 144 us per loop

using np.ix_:

In [84]: %timeit a[np.ix_(x_range<5,np.in1d(y_range,[3,4,8]))]
10000 loops, best of 3: 160 us per loop

Note: It would be a good idea to test these timings on your machine since the rankings might change depending on your version of Python, numpy, or hardware.

Answered By: unutbu

Answer 3

For anyone that still struggles to understand what is going on

Python for Data Analysis by Wes McKinney has a good explanations:
https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.html
chapter: "Fancy Indexing"

In short.

A boolean indexes are converted to arrays of indexes using np.nonzero as @pv. explained.
Then we have two "fancy" indexes, which means selecting 1d array of elements for corresponding tuple of indexes.

>>> A=np.arange(0,9).reshape(3,-1)*10
>>> A
array([[ 0, 10, 20],
       [30, 40, 50],
       [60, 70, 80]])

>>> A[[1,2],[0,1]]
array([30, 70])

As you can see it selected the values at indexes 1,0 and 2,1

>>> [A[1,0], A[2,1]] 
[30, 70]

Answered By: Piotr Czapla

Answer 4

Another way to achieve this is to select the per-axis indices you want separately, i.e.:

A[rows, :][:, cols]

Concrete example:

>>> A = np.arange(9).reshape(3, 3)
>>> A
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

# Slicing works as expected.
>>> A[1:, :2]
array([[3, 4],
       [6, 7]])

# Indices that represent slice.
>>> cols = [0, 1]
>>> rows = [1, 2]
# Per OP, counterintuitively different.
>>> A[rows, cols]
array([3, 7])

# Workaround: Select axes separately.
>>> A[rows, :][:, cols]
array([[3, 4],
       [6, 7]])

(silly code to produce above text: link)

Answered By: Eric Cousineau

Understanding weird boolean 2d-array indexing behavior in numpy

Question:

Answers: