Why is 0/1 faster than False/True for this sieve in PyPy?

Question:

Similar to why use True is slower than use 1 in Python3 but I’m using pypy3 and not using the sum function.

def sieve_num(n):
    nums = [0] * n
    for i in range(2, n):
        if i * i >= n: break
        if nums[i] == 0:
            for j in range(i*i, n, i):
                nums[j] = 1

    return [i for i in range(2, n) if nums[i] == 0]


def sieve_bool(n):
    nums = [False] * n
    for i in range(2, n):
        if i * i >= n: break
        if nums[i] == False:
            for j in range(i*i, n, i):
                nums[j] = True

    return [i for i in range(2, n) if nums[i] == False]

sieve_num(10**8) takes 2.55 s, but sieve_bool(10**8) takes 4.45 s, which is a noticeable difference.

My suspicion was that [0]*n is somehow smaller than [False]*n and fits into cache better, but sys.getsizeof and vmprof line profiling are unsupported for PyPy. The only info I could get is that <listcomp> for sieve_num took 116 ms (19% of total execution time) while <listcomp> for sieve_bool tool 450 ms (40% of total execution time).

Using PyPy 7.3.1 implementing Python 3.6.9 on Intel i7-7700HQ with 24 GB RAM on Ubuntu 20.04. With Python 3.8.10 sieve_bool is only slightly slower.

Asked By: qwr

||

Answers:

The reason is that PyPy uses a special implementation for "list of ints that fit in 64 bits". It has got a few other special cases, like "list of floats", "list of strings that contain only ascii", etc. The goal is primarily to save memory: a list of 64-bit integers is stored just like an array.array('l') and not a list of pointers to actual integer objects. You save memory not in the size of the list itself—which doesn’t change—but in the fact that you don’t need a very large number of small additional integer objects all existing at once.

There is no special case for "list of boolean", because there are only ever two boolean objects in the first place. So there would be no memory-saving benefit in using a strategy like "list of 64-bit ints" in this case. Of course, we could do better and store that list with only one bit per entry, but it is not a really common pattern in Python; we just never got around to implementing that.

So why it is slower, anyway? The reason is that in the "list of general objects" case, the JIT compiler needs to produce extra code to check the type of objects every time it reads an item from the list, and extra GC logic every time it puts an item into the list. This is not a lot of code, but in your case, I guess it doubles the length of the (extremely short) generated assembly for the inner loop doing nums[j] = 1.

Right now, both in PyPy and CPython(*), the fastest is probably to use array.array('B') instead of a list, which both avoids that PyPy-specific issue and also uses substantially less memory (always a performance win if your data structures contain 10**8 elements).

EDIT: (*) no, turns out that CPython is probably too slow for the memory bandwidth to be a limit. On my machine, PyPy is maybe 30-35% faster when using bytes. See also comments for a hack that speeds up CPython from 9x to 3x slower than PyPy, but which as usual slows down PyPy.

Answered By: Armin Rigo
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.