Efficient way to compute intersecting values between two numpy arrays
Question:
I have a bottleneck in my program which is caused by the following:
A = numpy.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
B = numpy.array([1,4,5,6,7,8,9])
C = numpy.array([i for i in A if i in B])
The expected outcome for C
is the following:
C = [4 6 7 1 5 4 1 1 9]
Is there a more efficient way of doing this operation?
Note that array A
contains repeating values and they need to be taken into account. I wasn’t able to use set intersection since taking the intersection will omit the repeating values, returning just [1,4,5,6,7,9]
.
Also note this is only a simple demonstration. The actual array sizes can be in the order of thousands, to well over millions.
Answers:
Use numpy.in1d
:
>>> A[np.in1d(A, B)]
array([4, 6, 7, 1, 5, 4, 1, 1, 9])
You can use np.in1d
:
>>> A[np.in1d(A, B)]
array([4, 6, 7, 1, 5, 4, 1, 1, 9])
np.in1d
returns a boolean array indicating whether each value of A
also appears in B
. This array can then be used to index A
and return the common values.
It’s not relevant to your example, but it’s also worth mentioning that if A
and B
each contain unique values then np.in1d
can be sped up by setting assume_unique=True
:
np.in1d(A, B, assume_unique=True)
You might also be interested in np.intersect1d
which returns an array of the unique values common to both arrays (sorted by value):
>>> np.intersect1d(A, B)
array([1, 4, 5, 6, 7, 9])
If you check only for existence in B
(if i in B
) then obviously you can use a set
for this. It doesn’t matter how many fours there are in B
as long as there is at least one. Of course you are right, that you can’t use two sets and an intersection. But even one set
should improve performance, as searching complexity is less than O(n):
A = numpy.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
B = set([1,4,5,6,7,8,9])
C = numpy.array([i for i in A if i in B])
We can use np.searchsorted
for performance boost, more so for the case when the lookup array has sorted unique values –
def intersect1d_searchsorted(A,B,assume_unique=False):
if assume_unique==0:
B_ar = np.unique(B)
else:
B_ar = B
idx = np.searchsorted(B_ar,A)
idx[idx==len(B_ar)] = 0
return A[B_ar[idx] == A]
That assume_unique
flag makes it work for both generic case and the special case of B
being unique and sorted.
Sample run –
In [89]: A = np.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
...: B = np.array([1,4,5,6,7,8,9])
In [90]: intersect1d_searchsorted(A,B,assume_unique=True)
Out[90]: array([4, 6, 7, 1, 5, 4, 1, 1, 9])
Timings to compare against another vectorized np.in1d
based solution (listed in two other answers) on large arrays for both cases –
In [103]: A = np.random.randint(0,10000,(1000000))
In [104]: B = np.random.randint(0,10000,(1000000))
In [105]: %timeit A[np.in1d(A, B)]
...: %timeit A[np.in1d(A, B, assume_unique=False)]
...: %timeit intersect1d_searchsorted(A,B,assume_unique=False)
1 loop, best of 3: 197 ms per loop
10 loops, best of 3: 190 ms per loop
10 loops, best of 3: 151 ms per loop
In [106]: B = np.unique(np.random.randint(0,10000,(5000)))
In [107]: %timeit A[np.in1d(A, B)]
...: %timeit A[np.in1d(A, B, assume_unique=True)]
...: %timeit intersect1d_searchsorted(A,B,assume_unique=True)
10 loops, best of 3: 130 ms per loop
1 loop, best of 3: 218 ms per loop
10 loops, best of 3: 80.2 ms per loop
1-use the set intersection as it’s super fast in this case
c = set(a).intersection(b)
2-use the numpy intersect1d method which is faster than looping but slower than the first method
c = numpy.intersect1d(a,b)
I have a bottleneck in my program which is caused by the following:
A = numpy.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
B = numpy.array([1,4,5,6,7,8,9])
C = numpy.array([i for i in A if i in B])
The expected outcome for C
is the following:
C = [4 6 7 1 5 4 1 1 9]
Is there a more efficient way of doing this operation?
Note that array A
contains repeating values and they need to be taken into account. I wasn’t able to use set intersection since taking the intersection will omit the repeating values, returning just [1,4,5,6,7,9]
.
Also note this is only a simple demonstration. The actual array sizes can be in the order of thousands, to well over millions.
Use numpy.in1d
:
>>> A[np.in1d(A, B)]
array([4, 6, 7, 1, 5, 4, 1, 1, 9])
You can use np.in1d
:
>>> A[np.in1d(A, B)]
array([4, 6, 7, 1, 5, 4, 1, 1, 9])
np.in1d
returns a boolean array indicating whether each value of A
also appears in B
. This array can then be used to index A
and return the common values.
It’s not relevant to your example, but it’s also worth mentioning that if A
and B
each contain unique values then np.in1d
can be sped up by setting assume_unique=True
:
np.in1d(A, B, assume_unique=True)
You might also be interested in np.intersect1d
which returns an array of the unique values common to both arrays (sorted by value):
>>> np.intersect1d(A, B)
array([1, 4, 5, 6, 7, 9])
If you check only for existence in B
(if i in B
) then obviously you can use a set
for this. It doesn’t matter how many fours there are in B
as long as there is at least one. Of course you are right, that you can’t use two sets and an intersection. But even one set
should improve performance, as searching complexity is less than O(n):
A = numpy.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
B = set([1,4,5,6,7,8,9])
C = numpy.array([i for i in A if i in B])
We can use np.searchsorted
for performance boost, more so for the case when the lookup array has sorted unique values –
def intersect1d_searchsorted(A,B,assume_unique=False):
if assume_unique==0:
B_ar = np.unique(B)
else:
B_ar = B
idx = np.searchsorted(B_ar,A)
idx[idx==len(B_ar)] = 0
return A[B_ar[idx] == A]
That assume_unique
flag makes it work for both generic case and the special case of B
being unique and sorted.
Sample run –
In [89]: A = np.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
...: B = np.array([1,4,5,6,7,8,9])
In [90]: intersect1d_searchsorted(A,B,assume_unique=True)
Out[90]: array([4, 6, 7, 1, 5, 4, 1, 1, 9])
Timings to compare against another vectorized np.in1d
based solution (listed in two other answers) on large arrays for both cases –
In [103]: A = np.random.randint(0,10000,(1000000))
In [104]: B = np.random.randint(0,10000,(1000000))
In [105]: %timeit A[np.in1d(A, B)]
...: %timeit A[np.in1d(A, B, assume_unique=False)]
...: %timeit intersect1d_searchsorted(A,B,assume_unique=False)
1 loop, best of 3: 197 ms per loop
10 loops, best of 3: 190 ms per loop
10 loops, best of 3: 151 ms per loop
In [106]: B = np.unique(np.random.randint(0,10000,(5000)))
In [107]: %timeit A[np.in1d(A, B)]
...: %timeit A[np.in1d(A, B, assume_unique=True)]
...: %timeit intersect1d_searchsorted(A,B,assume_unique=True)
10 loops, best of 3: 130 ms per loop
1 loop, best of 3: 218 ms per loop
10 loops, best of 3: 80.2 ms per loop
1-use the set intersection as it’s super fast in this case
c = set(a).intersection(b)
2-use the numpy intersect1d method which is faster than looping but slower than the first method
c = numpy.intersect1d(a,b)