list comprehension filtering – "the set() trap"

Question:

A reasonably common operation is to filter one list based on another list. People quickly find that this:

[x for x in list_1 if x in list_2]

is slow for large inputs – it’s O(n*m). Yuck. How do we speed this up? Use a set to make filtering lookups O(1):

s = set(list_2)
[x for x in list_1 if x in s]

This gives nice overall O(n) behavior. I however often see even veteran coders fall into The Trap™:

[x for x in list_1 if x in set(list_2)]

Ack! This is again O(n*m) since python builds set(list_2) every time, not just once.


I thought that was the end of the story – python can’t optimize it away to only build the set once. Just be aware of the pitfall. Gotta live with it. Hmm.

#python 3.3.2+
list_2 = list(range(20)) #small for demonstration purposes
s = set(list_2)
list_1 = list(range(100000))
def f():
    return [x for x in list_1 if x in s]
def g():
    return [x for x in list_1 if x in set(list_2)]
def h():
    return [x for x in list_1 if x in {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}]

%timeit f()
100 loops, best of 3: 7.31 ms per loop

%timeit g()
10 loops, best of 3: 77.4 ms per loop

%timeit h()
100 loops, best of 3: 6.66 ms per loop

Huh, python (3.3) can optimize away a set literal. It’s even faster than f() in this case, presumably because it gets to replace a LOAD_GLOBAL with a LOAD_FAST.

#python 2.7.5+
%timeit h()
10 loops, best of 3: 72.5 ms per loop

Python 2 notably doesn’t do this optimization. I’ve tried investigating further what python3 is doing but unfortunately dis.dis cannot probe the innards of comprehension expressions. Basically everything interesting turns into MAKE_FUNCTION.

So now I’m wondering – why can python 3.x optimize away the set literal to only build once, but not set(list_2)?

Asked By: roippi

||

Answers:

In order to optimize set(list_2), the interpreter needs to prove that list_2 (and all of its elements) does not change between iterations. This is a hard problem in the general case, and it would not surprise me if the interpreter does not even attempt to tackle it.

On the other hand a set literal cannot change its value between iterations, so the optimization is known to be safe.

Answered By: NPE

The basic reason is that a literal really can’t change, whereas if it’s an expression like set(list_2), it’s possible that evaluating the target expression or the iterable of the comprehension could change the value of set(list_2). For instance, if you have

[f(x) for x in list_1 if x in set(list_2)]

It is possible that f modifies list_2.

Even for a simple [x for x in blah ...] expression, it’s theoretically possible that the __iter__ method of blah could modify list_2.

I would imagine there is some scope for optimizations, but the current behavior keeps things simpler. If you start adding optimizations for things like “it is only evaluated once if the target expression is a single bare name and the iterable is a builtin list or dict…” you make it much more complicated to figure out what will happen in any given situation.

Answered By: BrenBarn

From What’s New In Python 3.2:

Python’s peephole optimizer now recognizes patterns such x in {1, 2,
3}
as being a test for membership in a set of constants. The optimizer
recasts the set as a frozenset and stores the pre-built constant.

Answered By: Ashwini Chaudhary

So now I’m wondering – why can python 3.x optimize away the set
literal to only build once, but not set(list_2)?

No one’s mentioned this issue yet: how do you know set([1,2,3]) and {1, 2, 3} are the same thing?

>>> import random
>>> def set(arg):
...     return [random.choice(range(5))]
... 
>>> list1 = list(range(5))
>>> [x for x in list1 if x in set(list1)]
[0, 4]
>>> [x for x in list1 if x in set(list1)]
[0]

You can’t shadow a literal; you can shadow set. So before you can consider hoisting, you need to know not just that list1 isn’t being affected, you need to be sure that set is what you think it is. Sometimes you can do that, either under restrictive conditions at compile time or more conveniently at runtime, but it’s definitely nontrivial.

It’s kind of funny: often when the suggestion of doing optimizations like this comes up, one pushback is that as nice as they are, it makes it harder to reason about what Python performance is going to be like, even algorithmically. Your question provides some evidence for this objection.

Answered By: DSM

Too long for a comment

This won’t speak to the optimization details or v2 vs. v3 differences. But when I encounter this in some situations, I find making a context manager out of the data object is useful:

class context_set(set):
    def __enter__(self):
        return self
    def __exit__(self, *args):
        pass

def context_version():
    with context_set(list_2) as s:
        return [x for x in list_1 if x in s]

Using this I see:

In [180]: %timeit context_version()
100 loops, best of 3: 17.8 ms per loop

and in some cases, it provides a nice stop-gap between creating the object before the comprehension vs. creating it within the comprehension, and allows custom tear-down code if you want it.

A more generic version can be made using contextlib.contextmanager. Here’s a quick-and-dirty version of what I mean.

def context(some_type):
    from contextlib import contextmanager
    generator_apply_type = lambda x: (some_type(y) for y in (x,))
    return contextmanager(generator_apply_type)

Then one can do:

with context(set)(list_2) as s:
    # ...

or just as easily

with context(tuple)(list_2) as t:
    # ...
Answered By: ely