numpy isin for multi-dimmensions
Question:
I have a big array of integers and second array of arrays. I want to create a boolean mask for the first array based on data from the second array of arrays. Preferably I would use the numpy.isin
but it clearly states in it’s documentation:
The values against which to test each value of element. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.
Do you maybe know some performant way of doing this instead of list comprehension?
So for example having those arrays:
a = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
b = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
I would like to have result like:
np.array([
[True, True, False, False, False, False, False, False, False, False],
[False, False, True, True, False, False, False, False, False, False],
[False, False, False, False, True, True, False, False, False, False],
[False, False, False, False, False, False, True, True, False, False],
[False, False, False, False, False, False, False, False, True, True]
])
Answers:
Try numpy.apply_along_axis
to work with numpy.isin
:
np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b)
returns
array([[[ True, True, False, False, False, False, False, False, False, False]],
[[False, False, True, True, False, False, False, False, False, False]],
[[False, False, False, False, True, True, False, False, False, False]],
[[False, False, False, False, False, False, True, True, False, False]],
[[False, False, False, False, False, False, False, False, True, True]]])
I will update with an edit comparing the runtime with a list comp
EDIT:
Whelp, I tested the runtime, and wouldn’t you know, listcomp is faster
timeit.timeit("[np.isin(a,x) for x in b]",number=10000, globals=globals())
0.37380070000654086
vs
timeit.timeit("np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b) ",number=10000, globals=globals())
0.6078917000122601
the other answer to this post by @mozway is much faster:
timeit.timeit("(a == b[...,None]).any(-2)",number=100, globals=globals())
0.007107900004484691
and should probably be accepted.
You can use broadcasting to avoid any loop (this is however more memory expensive):
(a == b[...,None]).any(-2)
Output:
array([[ True, True, False, False, False, False, False, False, False, False],
[False, False, True, True, False, False, False, False, False, False],
[False, False, False, False, True, True, False, False, False, False],
[False, False, False, False, False, False, True, True, False, False],
[False, False, False, False, False, False, False, False, True True]])
This is a bit cheated but ultra fast solution. The cheating is that I sort the seconds matrix before so that I can use binary search.
@nb.njit(parallel=True)
def isin_multi(a, b):
out = np.zeros((b.shape[0], a.shape[0]), dtype=nb.boolean)
for i in nb.prange(a.shape[0]):
for j in nb.prange(b.shape[0]):
index = np.searchsorted(b[j], a[i])
if index >= len(b[j]) or b[j][index] != a[i]:
out[j][i] = False
else:
out[j][i] = True
break
return out
a = np.random.randint(200000, size=200000)
b = np.random.randint(200000, size=(50, 5000))
b = np.sort(b, axis=1)
start = time.perf_counter()
for _ in range(20):
isin_multi(a, b)
print(f"isin_multi {time.perf_counter() - start:.3f} seconds")
start = time.perf_counter()
for _ in range(20):
np.array([np.isin(a, ids) for ids in b])
print(f"comprehension {time.perf_counter() - start:.3f} seconds")
Results:
isin_multi 2.951 seconds.
comprehension 21.093 seconds
I have a big array of integers and second array of arrays. I want to create a boolean mask for the first array based on data from the second array of arrays. Preferably I would use the numpy.isin
but it clearly states in it’s documentation:
The values against which to test each value of element. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.
Do you maybe know some performant way of doing this instead of list comprehension?
So for example having those arrays:
a = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
b = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
I would like to have result like:
np.array([
[True, True, False, False, False, False, False, False, False, False],
[False, False, True, True, False, False, False, False, False, False],
[False, False, False, False, True, True, False, False, False, False],
[False, False, False, False, False, False, True, True, False, False],
[False, False, False, False, False, False, False, False, True, True]
])
Try numpy.apply_along_axis
to work with numpy.isin
:
np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b)
returns
array([[[ True, True, False, False, False, False, False, False, False, False]],
[[False, False, True, True, False, False, False, False, False, False]],
[[False, False, False, False, True, True, False, False, False, False]],
[[False, False, False, False, False, False, True, True, False, False]],
[[False, False, False, False, False, False, False, False, True, True]]])
I will update with an edit comparing the runtime with a list comp
EDIT:
Whelp, I tested the runtime, and wouldn’t you know, listcomp is faster
timeit.timeit("[np.isin(a,x) for x in b]",number=10000, globals=globals())
0.37380070000654086
vs
timeit.timeit("np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b) ",number=10000, globals=globals())
0.6078917000122601
the other answer to this post by @mozway is much faster:
timeit.timeit("(a == b[...,None]).any(-2)",number=100, globals=globals())
0.007107900004484691
and should probably be accepted.
You can use broadcasting to avoid any loop (this is however more memory expensive):
(a == b[...,None]).any(-2)
Output:
array([[ True, True, False, False, False, False, False, False, False, False],
[False, False, True, True, False, False, False, False, False, False],
[False, False, False, False, True, True, False, False, False, False],
[False, False, False, False, False, False, True, True, False, False],
[False, False, False, False, False, False, False, False, True True]])
This is a bit cheated but ultra fast solution. The cheating is that I sort the seconds matrix before so that I can use binary search.
@nb.njit(parallel=True)
def isin_multi(a, b):
out = np.zeros((b.shape[0], a.shape[0]), dtype=nb.boolean)
for i in nb.prange(a.shape[0]):
for j in nb.prange(b.shape[0]):
index = np.searchsorted(b[j], a[i])
if index >= len(b[j]) or b[j][index] != a[i]:
out[j][i] = False
else:
out[j][i] = True
break
return out
a = np.random.randint(200000, size=200000)
b = np.random.randint(200000, size=(50, 5000))
b = np.sort(b, axis=1)
start = time.perf_counter()
for _ in range(20):
isin_multi(a, b)
print(f"isin_multi {time.perf_counter() - start:.3f} seconds")
start = time.perf_counter()
for _ in range(20):
np.array([np.isin(a, ids) for ids in b])
print(f"comprehension {time.perf_counter() - start:.3f} seconds")
Results:
isin_multi 2.951 seconds.
comprehension 21.093 seconds