How to get indices of top-K values from a numpy array
Question:
Let suppose I have probabilities from a Pytorch or Keras predictions and result is with the softmax function
from scipy.special import softmax
probs = softmax(np.random.randn(20,10),1) # 20 instances and 10 class probabilities
probs
I want to find top-5 indices from this numpy array. All I want to do is to run a loop on the results something like:
for index in top_5_indices:
if index in result:
print('Found')
I’ll get if my results are in top-5 results.
Pytorch
has top-k
function and I have seen numpy.argpartition
but I have no idea how to get this done?
Answers:
A little more expensive, but argsort
would do:
idx = np.argsort(probs, axis=1)[:,-5:]
If we are talking about pytorch:
probs = torch.from_numpy(softmax(np.random.randn(20,10),1))
values, idx = torch.topk(probs, k=5, axis=-1)
argpartition(a, k) function in numpy rearranges indices of input array a around the kth smallest element, so that all indices of smaller elements end up to the left, and all indices of bigger elements end up to the right. Not needing to sort all elements saves time: argpartition takes O(n) time, while argsort takes O(n log n) time.
So you can get the indices of 5 biggest elements like this:
np.argpartition(probs,-5)[-5:]
The existing answers are correct, but I wanted to expand on them to provide a self-contained function that behaves exactly like torch.topk
with pure numpy
.
Here’s the function (I’ve included the instructions inline):
def topk(array, k, axis=-1, sorted=True):
# Use np.argpartition is faster than np.argsort, but do not return the values in order
# We use array.take because you can specify the axis
partitioned_ind = (
np.argpartition(array, -k, axis=axis)
.take(indices=range(-k, 0), axis=axis)
)
# We use the newly selected indices to find the score of the top-k values
partitioned_scores = np.take_along_axis(array, partitioned_ind, axis=axis)
if sorted:
# Since our top-k indices are not correctly ordered, we can sort them with argsort
# only if sorted=True (otherwise we keep it in an arbitrary order)
sorted_trunc_ind = np.flip(
np.argsort(partitioned_scores, axis=axis), axis=axis
)
# We again use np.take_along_axis as we have an array of indices that we use to
# decide which values to select
ind = np.take_along_axis(partitioned_ind, sorted_trunc_ind, axis=axis)
scores = np.take_along_axis(partitioned_scores, sorted_trunc_ind, axis=axis)
else:
ind = partitioned_ind
scores = partitioned_scores
return scores, ind
To verify the correctness, you can test it against torch:
import torch
import numpy as np
x = np.random.randn(50, 50, 10, 10)
axis = 2 # Change this to any axis and it'll be fine
val_np, ind_np = topk(x, k=10, axis=axis)
val_pt, ind_pt = torch.topk(torch.tensor(x), k=10, dim=axis)
print("Values are same:", np.all(val_np == val_pt.numpy()))
print("Indices are same:", np.all(ind_np == ind_pt.numpy()))
- To be clear,
np.take_along_axis
is recommended to be used with np.argpartition
for accessing the original value in the higher-dimension.
np.argpartition
is faster than np.argsort
because it does not sort the entire array. This answer claims it takes O(n)
instead of `O(n log
Let suppose I have probabilities from a Pytorch or Keras predictions and result is with the softmax function
from scipy.special import softmax
probs = softmax(np.random.randn(20,10),1) # 20 instances and 10 class probabilities
probs
I want to find top-5 indices from this numpy array. All I want to do is to run a loop on the results something like:
for index in top_5_indices:
if index in result:
print('Found')
I’ll get if my results are in top-5 results.
Pytorch
has top-k
function and I have seen numpy.argpartition
but I have no idea how to get this done?
A little more expensive, but argsort
would do:
idx = np.argsort(probs, axis=1)[:,-5:]
If we are talking about pytorch:
probs = torch.from_numpy(softmax(np.random.randn(20,10),1))
values, idx = torch.topk(probs, k=5, axis=-1)
argpartition(a, k) function in numpy rearranges indices of input array a around the kth smallest element, so that all indices of smaller elements end up to the left, and all indices of bigger elements end up to the right. Not needing to sort all elements saves time: argpartition takes O(n) time, while argsort takes O(n log n) time.
So you can get the indices of 5 biggest elements like this:
np.argpartition(probs,-5)[-5:]
The existing answers are correct, but I wanted to expand on them to provide a self-contained function that behaves exactly like torch.topk
with pure numpy
.
Here’s the function (I’ve included the instructions inline):
def topk(array, k, axis=-1, sorted=True):
# Use np.argpartition is faster than np.argsort, but do not return the values in order
# We use array.take because you can specify the axis
partitioned_ind = (
np.argpartition(array, -k, axis=axis)
.take(indices=range(-k, 0), axis=axis)
)
# We use the newly selected indices to find the score of the top-k values
partitioned_scores = np.take_along_axis(array, partitioned_ind, axis=axis)
if sorted:
# Since our top-k indices are not correctly ordered, we can sort them with argsort
# only if sorted=True (otherwise we keep it in an arbitrary order)
sorted_trunc_ind = np.flip(
np.argsort(partitioned_scores, axis=axis), axis=axis
)
# We again use np.take_along_axis as we have an array of indices that we use to
# decide which values to select
ind = np.take_along_axis(partitioned_ind, sorted_trunc_ind, axis=axis)
scores = np.take_along_axis(partitioned_scores, sorted_trunc_ind, axis=axis)
else:
ind = partitioned_ind
scores = partitioned_scores
return scores, ind
To verify the correctness, you can test it against torch:
import torch
import numpy as np
x = np.random.randn(50, 50, 10, 10)
axis = 2 # Change this to any axis and it'll be fine
val_np, ind_np = topk(x, k=10, axis=axis)
val_pt, ind_pt = torch.topk(torch.tensor(x), k=10, dim=axis)
print("Values are same:", np.all(val_np == val_pt.numpy()))
print("Indices are same:", np.all(ind_np == ind_pt.numpy()))
- To be clear,
np.take_along_axis
is recommended to be used withnp.argpartition
for accessing the original value in the higher-dimension. np.argpartition
is faster thannp.argsort
because it does not sort the entire array. This answer claims it takesO(n)
instead of `O(n log