In numpy, how to efficiently build a mapping from each unique value to its indices, without using a for loop

Question:

In numpy, how to efficiently build a mapping from each unique value to its indices, without using a for loop

I considered the following alternatives, but they are not efficient enough for my use case because I use large arrays.

The first alternative, requires traversing the array with for loop, which may be slow when considering large numpy arrays.

import numpy as np
from collections import defaultdict
a = np.array([1, 2, 6, 4, 2, 3, 2])
inv = defaultdict(list)
for i, x in enumerate(a):
    inv[x].append(i)

The second alternative is non-efficient because it requires travesing the array multiple times:

import numpy as np
a = np.array([1, 2, 6, 4, 2, 3, 2])
inv = {}
for x in np.unique(a):
    inv[x] = np.flatnonzero(a == x)

EDIT: My numpy array consists of integers and the usage is for image segmentation. I was also looking for a method in skimage, but did not find any.

Asked By: Yuval Atzmon

||

Answers:

I advise you to check out numba which can speed up numpy code on python significantly – it supports numpy.invert() and numpy.unique()documentation

Here is a good video explaining how to use numba from youtube – Make Python code 1000x Faster with Numba

Answered By: pugi

This should work.

a = np.array((1, 2, 6, 2, 4, 7, 25, 6))
fwd = np.argsort(a)
asorted = a[fwd]
keys = np.unique(asorted)
lower = np.searchsorted(asorted, keys)
# or higher = itertools.chain(lower[1:], (len(asorted),))
higher = np.append(lower[1:], len(asorted))
inv = {key: fwd[lower_i:higher_i]
       for key, lower_i, higher_i
       in zip(keys, lower, higher)}

assert all(np.all(a[indices] == key)
           for key, indices in inv.items())

It runs in something like O(n log(n)). The only loop that remains is the one to build a dictionary. That step is optional, of course.

From a purely algorithmic standpoint, your first approach (defaultdict(list)) would be better since it runs in aggregated linear time but of course the python overhead may be significant.

Answered By: Homer512

Maybe my answer comes a bit late, but if you are ok using pandas I think you can use groupby, which should run in linear time

import pandas as pd
a = np.array([1, 2, 6, 4, 2, 3, 2])
df = pd.DataFrame({'values': a})
inv = df.groupby(by='values').apply(lambda group: np.array(group.index))
inv = dict(inv)
Answered By: Lorenzo Guerini