NumPy selecting specific column index per row by using a list of indexes

Question:

I’m struggling to select the specific columns per row of a NumPy matrix.

Suppose I have the following matrix which I would call X:

[1, 2, 3]
[4, 5, 6]
[7, 8, 9]

I also have a list of column indexes per every row which I would call Y:

[1, 0, 2]

I need to get the values:

[2]
[4]
[9]

Instead of a list with indexes Y, I can also produce a matrix with the same shape as X where every column is a bool / int in the range 0-1 value, indicating whether this is the required column.

[0, 1, 0]
[1, 0, 0]
[0, 0, 1]

I know this can be done with iterating over the array and selecting the column values I need. However, this will be executed frequently on big arrays of data and that’s why it has to run as fast as it can.

I was thus wondering if there is a better solution?

Asked By: Zee

||

Answers:

If you’ve got a boolean array you can do direct selection based on that like so:

>>> a = np.array([True, True, True, False, False])
>>> b = np.array([1,2,3,4,5])
>>> b[a]
array([1, 2, 3])

To go along with your initial example you could do the following:

>>> a = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> b = np.array([[False,True,False],[True,False,False],[False,False,True]])
>>> a[b]
array([2, 4, 9])

You can also add in an arange and do direct selection on that, though depending on how you’re generating your boolean array and what your code looks like YMMV.

>>> a = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> a[np.arange(len(a)), [1,0,2]]
array([2, 4, 9])
Answered By: Slater Victoroff

You can do something like this:

In [7]: a = np.array([[1, 2, 3],
   ...: [4, 5, 6],
   ...: [7, 8, 9]])

In [8]: lst = [1, 0, 2]

In [9]: a[np.arange(len(a)), lst]
Out[9]: array([2, 4, 9])

More on indexing multi-dimensional arrays: http://docs.scipy.org/doc/numpy/user/basics.indexing.html#indexing-multi-dimensional-arrays

Answered By: Ashwini Chaudhary

You can do it by using iterator. Like this:

np.fromiter((row[index] for row, index in zip(X, Y)), dtype=int)

Time:

N = 1000
X = np.zeros(shape=(N, N))
Y = np.arange(N)

#@Aशwini चhaudhary
%timeit X[np.arange(len(X)), Y]
10000 loops, best of 3: 30.7 us per loop

#mine
%timeit np.fromiter((row[index] for row, index in zip(X, Y)), dtype=int)
1000 loops, best of 3: 1.15 ms per loop

#mine
%timeit np.diag(X.T[Y])
10 loops, best of 3: 20.8 ms per loop
Answered By: Kei Minagawa

A simple way might look like:

In [1]: a = np.array([[1, 2, 3],
   ...: [4, 5, 6],
   ...: [7, 8, 9]])

In [2]: y = [1, 0, 2]  #list of indices we want to select from matrix 'a'

range(a.shape[0]) will return array([0, 1, 2])

In [3]: a[range(a.shape[0]), y] #we're selecting y indices from every row
Out[3]: array([2, 4, 9])
Answered By: Dhaval Mayatra

Another clever way is to first transpose the array and index it thereafter. Finally, take the diagonal, its always the right answer.

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
Y = np.array([1, 0, 2, 2])

np.diag(X.T[Y])

Step by step:

Original arrays:

>>> X
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

>>> Y
array([1, 0, 2, 2])

Transpose to make it possible to index it right.

>>> X.T
array([[ 1,  4,  7, 10],
       [ 2,  5,  8, 11],
       [ 3,  6,  9, 12]])

Get rows in the Y order.

>>> X.T[Y]
array([[ 2,  5,  8, 11],
       [ 1,  4,  7, 10],
       [ 3,  6,  9, 12],
       [ 3,  6,  9, 12]])

The diagonal should now become clear.

>>> np.diag(X.T[Y])
array([ 2,  4,  9, 12]
Answered By: Thomas Devoogdt

Recent numpy versions have added a take_along_axis (and put_along_axis) that does this indexing cleanly.

In [101]: a = np.arange(1,10).reshape(3,3)                                                             
In [102]: b = np.array([1,0,2])                                                                        
In [103]: np.take_along_axis(a, b[:,None], axis=1)                                                     
Out[103]: 
array([[2],
       [4],
       [9]])

It operates in the same way as:

In [104]: a[np.arange(3), b]                                                                           
Out[104]: array([2, 4, 9])

but with different axis handling. It’s especially aimed at applying the results of argsort and argmax.

Answered By: hpaulj

The answer from hpaulj using take_along_axis should be the accepted one.

Here is a derived version with an N-dim index array:

>>> arr = np.arange(20).reshape((2,2,5))
>>> idx = np.array([[1,0],[2,4]])
>>> np.take_along_axis(arr, idx[...,None], axis=-1)
array([[[ 1],
        [ 5]],

       [[12],
        [19]]])

Note that the selection operation is ignorant about the shapes. I used this to refine a possibly vector-valued argmax result from histogram by fitting parabolas:

def interpol(arr):
    i = np.argmax(arr, axis=-1)
    a = lambda Δ: np.squeeze(np.take_along_axis(arr, i[...,None]+Δ, axis=-1), axis=-1)
    frac = .5*(a(1) - a(-1)) / (2*a(0) - a(-1) - a(1)) # |frac| < 0.5
    return i + frac

Note the squeeze to remove the dimension of size 1 resulting in the same shape of i and frac, the integer and fractional part of the peak position.

I’m quite sure that it is possible to avoid the lambda, but would the interpolation formula still look nice?

Answered By: Rainald62
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.