How do I efficiently find which elements of a list are in another list?

Question:

I want to know which elements of list_1 are in list_2. I need the output as an ordered list of booleans. But I want to avoid for loops, because both lists have over 2 million elements.

This is what I have and it works, but it’s too slow:

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = []
for i in list_1:
   booleans.append(i in list_2)

# booleans = [False, False, True, True, False, False]

I could split the list and use multithreading, but I would prefer a simpler solution if possible. I know some functions like sum() use vector operations. I am looking for something similar.

How can I make my code more efficient?

Asked By: herdek550

||

Answers:

You can use the map function.

Inside map I use the lambda function. If you are not familiar with the lambda function then you can check this out.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = list(map(lambda e:e in list_2,iter(list_1)))
print(booleans)

output

[False, False, True, True, False, False]

However, if you want the only elements which are not the same then instead of a map function you can use the filter function with the same code.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

new_lst = list(filter(lambda e:e in list_2,iter(list_1)))# edited instead of map use filter.
print(new_lst)

output

[1, 2]

Edited

I am removing the in statement from the code because in also acts as a loop. I am checking using the timeit module.

you can use this code for the list containing True and False.

This way is fastest then above one.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)

booleans = list(map(lambda e:set_2!=set_2-{e},iter(list_1)))
print(booleans)

output

[False, False, True, True, False, False]

This one is for the list containing the elements.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)

booleans = list(filter(lambda e:set_2!=set_2-{e},iter(list_1))) # edited instead of map use filter
print(booleans)

output

[1,2]

Because OP don’t want to use lambda function then this.

list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000
set_2 = set(list_2)
def func():
    return set_2!=set_2-{e}

booleans = list(map(func,iter(list_1)))

I know my way isn’t the best way to this answer this because I am never using NumPy much.

Answered By: codester_09

You can take advantage of the O(1) in operator complexity for the set() function to make your for loop more efficient, so your final algorithm would run in O(n) time, instead of O(n*n):

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

s = set(list_2)
booleans = []
for i in list_1:
   booleans.append(i in s)
print(booleans)

It is even faster as a list comprehension:

s = set(list_2)
booleans = [i in s for i in list_1]

If you only want to know the elements, you can use an intersection of sets like that, which will be an efficient solution due to the use of set() function, already optimized by other Python engineers:

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

print(set(list_1).intersection(set(list_2)))

Output:

{1, 2}

Also, to provide a list format output, you can turn your resulting set into a list with list() function:

print(list(set(list_1).intersection(set(list_2))))
Answered By: Cardstdani

Use set() to get a list of unique items in each list

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = []

set_1 = set(list_1)
set_2 = set(list_2)

if(set_1 & set_2):
  print(set_1 & set_2)
else:
  print("No common elements")

Output:

{1, 2}
Answered By: SPYBUG96

If you want to use a vector approach you can also use Numpy isin. It’s not the fastest method, as demonstrated by oda’s excellent post, but it’s definitely an alternative to consider.

import numpy as np

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

a1 = np.array(list_1)
a2 = np.array(list_2)

np.isin(a1, a2)
# array([False, False,  True,  True, False, False])
Answered By: crissal

I thought it would be useful to actually time some of the solutions presented here on a larger sample input. For this input and on my machine, I find Cardstdani’s approach to be the fastest, followed by the numpy isin() approach.

Setup 1

import random

list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(1, 10_000) for i in range(100_000)]

Setup 2

list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(10_001, 20_000) for i in range(100_000)]

Timings – ordered from fastest to slowest (setup 1).

Cardstdani – approach 1


I recommend converting Cardstdani’s approach into a list comprehension (see this question for why list comprehensions are faster)

s = set(list_2)
booleans = [i in s for i in list_1]

# setup 1
6.01 ms ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
4.19 ms ± 27.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

No list comprehension

s = set(list_2)
booleans = []
for i in list_1:
   booleans.append(i in s)

# setup 1
7.28 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
5.87 ms ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Cardstdani – approach 2 (with an assist from Timus)


common = set(list_1) & set(list_2)
booleans = [item in common for item in list_1]

# setup 1
8.3 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
6.01 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Using the set intersection method

common = set(list_1).intersection(list_2)
booleans = [item in common for item in list_1]

# setup 1
10.1 ms ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
4.82 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

numpy approach (crissal)


a1 = np.array(list_1)
a2 = np.array(list_2)

a = np.isin(a1, a2)

# setup 1
18.6 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
18.2 ms ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2 (assuming list_1, list_2 already numpy arrays)
10.3 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

list comprehension


l = [i in list_2 for i in list_1]

# setup 1
4.85 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48.6 s ± 823 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sharim – approach 1


booleans = list(map(lambda e: e in list_2, list_1))

# setup 1
4.88 s ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48 s ± 389 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using the __contains__ method

booleans = list(map(list_2.__contains__, list_1))

# setup 1
4.87 s ± 5.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48.2 s ± 486 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sharim – approach 2


set_2 = set(list_2)
booleans = list(map(lambda e: set_2 != set_2 - {e}, list_1))

# setup 1
5.46 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
11.1 s ± 75.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Varying the length of the input


Employing the following setup

import random 

list_1 = [random.randint(1, n) for i in range(n)]
list_2 = [random.randint(1, n) for i in range(n)]

and varying n in [2 ** k for k in range(18)]:

enter image description here

Employing the following setup

import random 

list_1 = [random.randint(1, n ** 2) for i in range(n)]
list_2 = [random.randint(1, n ** 2) for i in range(n)]

and varying n in [2 ** k for k in range(18)], we obtain similar results:

enter image description here

Employing the following setup

list_1 = list(range(n))
list_2 = list(range(n, 2 * n))

and varying n in [2 ** k for k in range(18)]:

enter image description here

Employing the following setup

import random 

list_1 = [random.randint(1, n) for i in range(10 * n)]
list_2 = [random.randint(1, n) for i in range(10 * n)]

and varying n in [2 ** k for k in range(18)]:

enter image description here

Answered By: oda

It’s probably simpler to just use the built-in set intersection method, but if you have lots of lists that you’re comparing, it might be faster to sort the lists. Sorting the list is n ln n, but once you have them sorted, you can compare them in linear time by checking whether the elements match, and if they don’t, advance to the next item in the list whose current element is smaller.

Answered By: Acccumulation

If you know the values are non-negative and the maximum value is much smaller than the length of the list, then using numpy’s bincount might be a good alternative for using a set.

np.bincount(list_1).astype(bool)[list_2]

If list_1 and list_2 happen to be numpy arrays, this can even be a lot faster than the set + list-comprehension solution. (In my test 263 µs vs 7.37 ms; but if they’re python lists, it’s slightly slower than the set solution, with 8.07 ms)

Answered By: towr

Spybug96’s method will work best and fastest. If you want to get an indented object with the common elements of the two sets you can use the tuple() function on the final set:

a = set(range(1, 6))
b = set(range(3, 9))
c = a & b
print(tuple(c))
Answered By: Kuba Sobolewski