efficiently knowing if intersection of two list is empty or not, in python
Question:
Suppose I have two lists, L and M. Now I want to know if they share an element.
Which would be the fastest way of asking (in python) if they share an element?
I don’t care which elements they share, or how many, just if they share or not.
For example, in this case
L = [1,2,3,4,5,6]
M = [8,9,10]
I should get False, and here:
L = [1,2,3,4,5,6]
M = [5,6,7]
I should get True.
I hope the question’s clear.
Thanks!
Manuel
Answers:
First of all, if you do not need them ordered, then switch to the set
type.
If you still need the list type, then do it this way: 0 == False
len(set.intersection(set(L), set(M)))
Or more concisely
if set(L) & set(M):
# there is an intersection
else:
# no intersection
If you really need True
or False
bool(set(L) & set(M))
After running some timings, this seems to be a good option to try too
m_set=set(M)
any(x in m_set for x in L)
If the items in M or L are not hashable you have to use a less efficient approach like this
any(x in M for x in L)
Here are some timings for 100 item lists. Using sets is considerably faster when there is no intersection, and a bit slower when there is a considerable intersection.
M=range(100)
L=range(100,200)
timeit set(L) & set(M)
10000 loops, best of 3: 32.3 µs per loop
timeit any(x in M for x in L)
1000 loops, best of 3: 374 µs per loop
timeit m_set=frozenset(M);any(x in m_set for x in L)
10000 loops, best of 3: 31 µs per loop
L=range(50,150)
timeit set(L) & set(M)
10000 loops, best of 3: 18 µs per loop
timeit any(x in M for x in L)
100000 loops, best of 3: 4.88 µs per loop
timeit m_set=frozenset(M);any(x in m_set for x in L)
100000 loops, best of 3: 9.39 µs per loop
# Now for some random lists
import random
L=[random.randrange(200000) for x in xrange(1000)]
M=[random.randrange(200000) for x in xrange(1000)]
timeit set(L) & set(M)
1000 loops, best of 3: 420 µs per loop
timeit any(x in M for x in L)
10 loops, best of 3: 21.2 ms per loop
timeit m_set=set(M);any(x in m_set for x in L)
1000 loops, best of 3: 168 µs per loop
timeit m_set=frozenset(M);any(x in m_set for x in L)
1000 loops, best of 3: 371 µs per loop
To avoid the work of constructing the intersection, and produce an answer as soon as we know that they intersect:
m_set = frozenset(M)
return any(x in m_set for x in L)
Update: gnibbler tried this out and found it to run faster with set() in place of frozenset(). Whaddayaknow.
That’s the most generic and efficient in a balanced way I could come up with (comments should make the code easy to understand):
import itertools, operator
def _compare_product(list1, list2):
"Return if any item in list1 equals any item in list2 exhaustively"
return any(
itertools.starmap(
operator.eq,
itertools.product(list1, list2)))
def do_they_intersect(list1, list2):
"Return if any item is common between list1 and list2"
# do not try to optimize for small list sizes
if len(list1) * len(list2) <= 100: # pick a small number
return _compare_product(list1, list2)
# first try to make a set from one of the lists
try: a_set= set(list1)
except TypeError:
try: a_set= set(list2)
except TypeError:
a_set= None
else:
a_list= list1
else:
a_list= list2
# here either a_set is None, or we have a_set and a_list
if a_set:
return any(itertools.imap(a_set.__contains__, a_list))
# try to sort the lists
try:
a_list1= sorted(list1)
a_list2= sorted(list2)
except TypeError: # sorry, not sortable
return _compare_product(list1, list2)
# they could be sorted, so let's take the N+M road,
# not the N*M
iter1= iter(a_list1)
iter2= iter(a_list2)
try:
item1= next(iter1)
item2= next(iter2)
except StopIteration: # one of the lists is empty
return False # ie no common items
while 1:
if item1 == item2:
return True
while item1 < item2:
try: item1= next(iter1)
except StopIteration: return False
while item2 < item1:
try: item2= next(iter2)
except StopIteration: return False
HTH.
Suppose I have two lists, L and M. Now I want to know if they share an element.
Which would be the fastest way of asking (in python) if they share an element?
I don’t care which elements they share, or how many, just if they share or not.
For example, in this case
L = [1,2,3,4,5,6]
M = [8,9,10]
I should get False, and here:
L = [1,2,3,4,5,6]
M = [5,6,7]
I should get True.
I hope the question’s clear.
Thanks!
Manuel
First of all, if you do not need them ordered, then switch to the set
type.
If you still need the list type, then do it this way: 0 == False
len(set.intersection(set(L), set(M)))
Or more concisely
if set(L) & set(M):
# there is an intersection
else:
# no intersection
If you really need True
or False
bool(set(L) & set(M))
After running some timings, this seems to be a good option to try too
m_set=set(M)
any(x in m_set for x in L)
If the items in M or L are not hashable you have to use a less efficient approach like this
any(x in M for x in L)
Here are some timings for 100 item lists. Using sets is considerably faster when there is no intersection, and a bit slower when there is a considerable intersection.
M=range(100)
L=range(100,200)
timeit set(L) & set(M)
10000 loops, best of 3: 32.3 µs per loop
timeit any(x in M for x in L)
1000 loops, best of 3: 374 µs per loop
timeit m_set=frozenset(M);any(x in m_set for x in L)
10000 loops, best of 3: 31 µs per loop
L=range(50,150)
timeit set(L) & set(M)
10000 loops, best of 3: 18 µs per loop
timeit any(x in M for x in L)
100000 loops, best of 3: 4.88 µs per loop
timeit m_set=frozenset(M);any(x in m_set for x in L)
100000 loops, best of 3: 9.39 µs per loop
# Now for some random lists
import random
L=[random.randrange(200000) for x in xrange(1000)]
M=[random.randrange(200000) for x in xrange(1000)]
timeit set(L) & set(M)
1000 loops, best of 3: 420 µs per loop
timeit any(x in M for x in L)
10 loops, best of 3: 21.2 ms per loop
timeit m_set=set(M);any(x in m_set for x in L)
1000 loops, best of 3: 168 µs per loop
timeit m_set=frozenset(M);any(x in m_set for x in L)
1000 loops, best of 3: 371 µs per loop
To avoid the work of constructing the intersection, and produce an answer as soon as we know that they intersect:
m_set = frozenset(M)
return any(x in m_set for x in L)
Update: gnibbler tried this out and found it to run faster with set() in place of frozenset(). Whaddayaknow.
That’s the most generic and efficient in a balanced way I could come up with (comments should make the code easy to understand):
import itertools, operator
def _compare_product(list1, list2):
"Return if any item in list1 equals any item in list2 exhaustively"
return any(
itertools.starmap(
operator.eq,
itertools.product(list1, list2)))
def do_they_intersect(list1, list2):
"Return if any item is common between list1 and list2"
# do not try to optimize for small list sizes
if len(list1) * len(list2) <= 100: # pick a small number
return _compare_product(list1, list2)
# first try to make a set from one of the lists
try: a_set= set(list1)
except TypeError:
try: a_set= set(list2)
except TypeError:
a_set= None
else:
a_list= list1
else:
a_list= list2
# here either a_set is None, or we have a_set and a_list
if a_set:
return any(itertools.imap(a_set.__contains__, a_list))
# try to sort the lists
try:
a_list1= sorted(list1)
a_list2= sorted(list2)
except TypeError: # sorry, not sortable
return _compare_product(list1, list2)
# they could be sorted, so let's take the N+M road,
# not the N*M
iter1= iter(a_list1)
iter2= iter(a_list2)
try:
item1= next(iter1)
item2= next(iter2)
except StopIteration: # one of the lists is empty
return False # ie no common items
while 1:
if item1 == item2:
return True
while item1 < item2:
try: item1= next(iter1)
except StopIteration: return False
while item2 < item1:
try: item2= next(iter2)
except StopIteration: return False
HTH.