check if numpy array is subset of another array

Question:

Similar questions have already been asked on SO, but they have more specific constraints and their answers don’t apply to my question.

Generally speaking, what is the most pythonic way to determine if an arbitrary numpy array is a subset of another array? More specifically, I have a roughly 20000×3 array and I need to know the indices of the 1×3 elements that are entirely contained within a set. More generally, is there a more pythonic way of writing the following:

master = [12, 155, 179, 234, 670, 981, 1054, 1209, 1526, 1667, 1853]  # some indices of interest
triangles = np.random.randint(2000, size=(20000, 3))  # some data

for i, x in enumerate(triangles):
    if x[0] in master and x[1] in master and x[2] in master:
        print i

For my use case, I can safely assume that len(master) << 20000. (Consequently, it is also safe to assume that master is sorted because this is cheap).

Asked By: aestrivex

||

Answers:

You can do this easily via iterating over an array in list comprehension. A toy example is as follows:

import numpy as np
x = np.arange(30).reshape(10,3)
searchKey = [4,5,8]
x[[0,3,7],:] = searchKey
x

gives

 array([[ 4,  5,  8],
        [ 3,  4,  5],
        [ 6,  7,  8],
        [ 4,  5,  8],
        [12, 13, 14],
        [15, 16, 17],
        [18, 19, 20],
        [ 4,  5,  8],
        [24, 25, 26],
        [27, 28, 29]])

Now iterate over the elements:

ismember = [row==searchKey for row in x.tolist()]

The result is

[True, False, False, True, False, False, False, True, False, False]

You can modify it for being a subset as in your question:

searchKey = [2,4,10,5,8,9]  # Add more elements for testing
setSearchKey = set(searchKey)
ismember = [setSearchKey.issuperset(row) for row in x.tolist()]

If you need the indices, then use

np.where(ismember)[0]

It gives

array([0, 3, 7])
Answered By: petrichor

Here are two approaches you could try:

1, Use sets. Sets are implemented much like python dictionaries and have have constant time lookups. That would look much like the code you already have, just create a set from master:

master = [12,155,179,234,670,981,1054,1209,1526,1667,1853]
master_set = set(master)
triangles = np.random.randint(2000,size=(20000,3)) #some data
for i, x in enumerate(triangles):
  if master_set.issuperset(x):
    print i

2, Use search sorted. This is nice because it doesn’t require you to use hashable types and uses numpy builtins. searchsorted is log(N) in the size of master and O(N) in the size of triangels so it should also be pretty fast, maybe faster depending on the size of your arrays and such.

master = [12,155,179,234,670,981,1054,1209,1526,1667,1853]
master = np.asarray(master)
triangles = np.random.randint(2000,size=(20000,3)) #some data
idx = master.searchsorted(triangles)
idx.clip(max=len(master) - 1, out=idx)
print np.where(np.all(triangles == master[idx], axis=1))

This second case assumes master is sorted, as searchsorted implies.

Answered By: Bi Rico

A more natural (and possibly faster) solution for set operations in numpy is to use the functions in numpy.lib.arraysetops. These generally allow you to avoid having to convert back and forth between Python’s set type. To check if one array is a subset of another, use numpy.setdiff1d() and test if the returned array has 0 length:

import numpy as np
a = np.arange(10)
b = np.array([1, 5, 9])
c = np.array([-5, 5, 9])
# is `a` a subset of `b`?
len(np.setdiff1d(a, b)) == 0 # gives False
# is `b` a subset of `a`?
len(np.setdiff1d(b, a)) == 0 # gives True
# is `c` a subset of `a`?
len(np.setdiff1d(c, a)) == 0 # gives False

You can also optionally set assume_unique=True for a potential speed boost.

I’m actually a bit surprised that numpy doesn’t have something like a built-in issubset() function to do the above (analogous to set.issubset()).

Another option is to use numpy.in1d() (see https://stackoverflow.com/a/37262010/2020363)

Edit: I just realized that at some point in the distant past this bothered me enough that I wrote my own simple function:

def issubset(a, b):
    """Return whether sequence `a` is a subset of sequence `b`"""
    return len(np.setdiff1d(a, b)) == 0
Answered By: Martin Spacek

starting with:

master=[12,155,179,234,670,981,1054,1209,1526,1667,1853] #some indices of interest

triangles=np.random.randint(2000,size=(20000,3)) #some data

What’s the most pythonic way to find indices of triplets contained in master? try using np.in1d with a list comprehension:

inds = [j for j in range(len(triangles)) if all(np.in1d(triangles[j], master))]

%timeit says ~0.5 s = half a second

–> MUCH faster way (factor of 1000!) that avoids python’s slow looping? Try using np.isin with np.sum to get a boolean mask for np.arange:

inds = np.where(
 np.sum(np.isin(triangles, master), axis=-1) == triangles.shape[-1])

%timeit says ~0.0005 s = half a millisecond!

Advice: avoid looping over lists whenever possible, because for the same price as a single iteration of a python loop containing one arithmetic operation, you can call a numpy function that does thousands of that same arithmetic operation

Conclusion

It seems that np.isin(arr1=triangles, arr2=master) is the function you were looking for, which gives a boolean mask of the same shape as arr1 telling whether each element of arr1 is also an element of arr2; from here, requiring that the sum of a mask row is 3 (i.e., the full length of a row in triangles) gives a 1d mask for the desired rows (or indices, using np.arange) of triangles.

Answered By: Mary O

One can also use np.isin which might be more efficient than the list comprehension in @petrichor’s answer. Using the same set up:

import numpy as np

x = np.arange(30).reshape(10, 3)
searchKey = [4, 5, 8]
x[[0, 3, 7], :] = searchKey
array([[ 4,  5,  8],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 4,  5,  8],
       [12, 13, 14],
       [15, 16, 17],
       [18, 19, 20],
       [ 4,  5,  8],
       [24, 25, 26],
       [27, 28, 29]])

Now one can use np.isin; by default, it will work element wise:

np.isin(x, searchKey)
array([[ True,  True,  True],
       [False,  True,  True],
       [False, False,  True],
       [ True,  True,  True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [ True,  True,  True],
       [False, False, False],
       [False, False, False]])

We now have to filter the rows where all entries evaluate to True for which we could use all:

np.isin(x, searchKey).all(1)
array([ True, False, False,  True, False, False, False,  True, False,
       False])

If one now wants the corresponding indices, one can use np.where:

np.where(np.isin(x, searchKey).all(1))
(array([0, 3, 7]),)

EDIT:

Just realize that one has to be careful though. For example, if I do

x[4, :] = [8, 4, 5]

so, in the assignment I use the same values as in searchKey but in a different order, I will still get it returned when doing

np.where(np.isin(x, searchKey).all(1))

which prints

(array([0, 3, 4, 7]),)

That can be undesired.

Answered By: Cleb
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.