numpy isin for multi-dimmensions

Question:

I have a big array of integers and second array of arrays. I want to create a boolean mask for the first array based on data from the second array of arrays. Preferably I would use the numpy.isin but it clearly states in it’s documentation:

The values against which to test each value of element. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.

Do you maybe know some performant way of doing this instead of list comprehension?
So for example having those arrays:

a = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
b = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])

I would like to have result like:

np.array([
       [True, True, False, False, False, False, False, False, False, False],
       [False, False, True, True, False, False, False, False, False, False],
       [False, False, False, False, True, True, False, False, False, False],
       [False, False, False, False, False, False, True, True, False, False],
       [False, False, False, False, False, False, False, False, True, True]
])
Asked By: Pawel

||

Answers:

Try numpy.apply_along_axis to work with numpy.isin:

np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b) 

returns

array([[[ True,  True, False, False, False, False, False, False, False, False]],                                                                                                                                                                                                                                      
       [[False, False,  True,  True, False, False, False, False, False, False]],                                                                                                                                                                                                                                      
       [[False, False, False, False,  True,  True, False, False, False, False]],                                                                                                                                                                                                                                      
       [[False, False, False, False, False, False,  True,  True, False, False]],                                                                                                                                                                                                                                      
       [[False, False, False, False, False, False, False, False,  True, True]]]) 

I will update with an edit comparing the runtime with a list comp

EDIT:

Whelp, I tested the runtime, and wouldn’t you know, listcomp is faster

timeit.timeit("[np.isin(a,x) for x in b]",number=10000, globals=globals()) 
0.37380070000654086

vs

timeit.timeit("np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b) ",number=10000, globals=globals())
0.6078917000122601 

the other answer to this post by @mozway is much faster:

timeit.timeit("(a == b[...,None]).any(-2)",number=100, globals=globals())                                           
0.007107900004484691

and should probably be accepted.

Answered By: David Kaftan

You can use broadcasting to avoid any loop (this is however more memory expensive):

(a == b[...,None]).any(-2)

Output:

array([[ True,  True, False, False, False, False, False, False, False, False],
       [False, False,  True,  True, False, False, False, False, False, False],
       [False, False, False, False,  True,  True, False, False, False, False],
       [False, False, False, False, False, False,  True,  True, False, False],
       [False, False, False, False, False, False, False, False,  True  True]])
Answered By: mozway

This is a bit cheated but ultra fast solution. The cheating is that I sort the seconds matrix before so that I can use binary search.

@nb.njit(parallel=True)
def isin_multi(a, b):
    out = np.zeros((b.shape[0], a.shape[0]), dtype=nb.boolean)

    for i in nb.prange(a.shape[0]):
        for j in nb.prange(b.shape[0]):
            index = np.searchsorted(b[j], a[i])
            if index >= len(b[j]) or b[j][index] != a[i]:
                out[j][i] = False
            else:
                out[j][i] = True
                break

    return out

a = np.random.randint(200000, size=200000)
b = np.random.randint(200000, size=(50, 5000))

b = np.sort(b, axis=1)

start = time.perf_counter()
for _ in range(20):
    isin_multi(a, b)
print(f"isin_multi {time.perf_counter() - start:.3f} seconds")

start = time.perf_counter()
for _ in range(20):
    np.array([np.isin(a, ids) for ids in b])
print(f"comprehension {time.perf_counter() - start:.3f} seconds")

Results:

isin_multi 2.951 seconds. 
comprehension 21.093 seconds
Answered By: Pawel
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.