How is `min` of two integers just as fast as 'bit hacking'?

Question

I was watching a lecture series on ‘Bit Hacking’ and came across the following optimization for finding the minimum of two integers:

return x ^ ((y ^ x) & -(x > y))

Which said to be faster than:

if x < y:
    return x
else:
    return y

Since the min function can handle more than just two integers (floats, strings, lists, and even custom objects) I assumed that calling min(x, y) would take longer than the optimized bit hack above. To my surprise, they were nearly identical:

>>> python -m timeit "min(4, 5)"
1000000 loops, best of 3: 0.203 usec per loop

>>> python -m timeit "4 ^ ((5 ^ 4) & -(4 > 5))"
10000000 loops, best of 3: 0.19 usec per loop

This is true even for numbers greater than 255 (pre allocated python integer objects)

>>> python -m timeit "min(15456, 54657)"
10000000 loops, best of 3: 0.191 usec per loop

python -m timeit "15456 ^ ((54657 ^ 15456) & -(54657 > 15456))"
10000000 loops, best of 3: 0.18 usec per loop

How is it that a function so versatile like min can still be so fast and optimized?

^{Note: I ran the above code using Python 3.5. I’m assuming that this is the same for Python 2.7+ but haven’t tested}

I’ve created the following c module:

#include <Python.h>

static PyObject * my_min(PyObject *self, PyObject *args){
    const long x;
    const long y;

    if (!PyArg_ParseTuple(args, "ll", &x, &y))
        return NULL;

    return PyLong_FromLong(x ^ ((y ^ x) & -(x > y)));
}

static PyMethodDef MyMinMethods[] = 
{
    { "my_min", my_min, METH_VARARGS, "bit hack min"
    },
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC
initmymin(void)
{
    PyObject *m;

    m = Py_InitModule("mymin", MyMinMethods);
    if (m == NULL)
        return;

}

Compiled it, and installed it onto my system (an ubuntu VM machine). I then ran the following:

>>> python -m timeit 'min(4, 5)'
10000000 loops, best of 3: 0.11 usec per loop

>>> python -m timeit -s 'import mymin' 'mymin.my_min(4,5)'
10000000 loops, best of 3: 0.129 usec per loop

While I understand that this is a VM machine, shouldn’t there still be a greater gap in execution time with the ‘bit hacking’ being offloaded into native c?

Asked By: James Mertz

||

Source

Answer 1

This is likely due to how the min function is implemented in python.

Many python builtins are actually implemented in low level languages such as C or assembly and use the python apis in order to be callable in python.

Your bit fiddling technique is likely very fast in C but in python the interpretation overhead of the statement will far exceed the overhead of calling even a complex function implemented in a low level language.

If you really want a fair test compare a C program, or C python extension implementing that technique to your python call of min and see how it compares, I expect that will explain the result you see.

EDIT:

Thanks to @Two-BitAlchemist I can now give some more details onto additional reasons this bit twiddling will not work well in python. It appears that integers are not stored in the obvious way but are actually a fairly complex expanding object designed to store potentially very large numbers.

Some details on this are findable here (Thanks to Two-BitAlchemist) though it appears this is changed somewhat in newer python versions. Still the point remains that we are most certainly not manipulation a simple set of bits when we touch an integer in python, but a complex object where the bit manipulations are in fact virtual method calls with enormous overhead (compared to what they do).

Answered By: Vality

Answer 2

Well, the bit hacking trick might have been faster in the 90s, but it is slower on current machines by a factor of two. Compare for yourself:

// gcc -Wall -Wextra -std=c11 ./min.c -D_POSIX_SOURCE -Os
// ./a.out 42

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define COUNT (1 << 28)

static int array[COUNT];

int main(int argc, char **argv) {
    (void) argc;
    unsigned seed = atoi(argv[1]);

    for (unsigned i = 0; i < COUNT; ++i) {
        array[i] = rand_r(&seed);
    }

    clock_t begin = clock();

    int x = array[0];
    for (unsigned i = 1; i < COUNT; ++i) {
        int y = array[i];
#if 1
        x = x ^ ((y ^ x) & -(x > y));
# else
        if (y < x) {
            x = y;
        }
#endif
    }

    clock_t end = clock();
    double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;

    printf("Minimum: %d (%.3f seconds)n", x, time_spent);
    return 0;
}

On an average 0.277 seconds in the “naïve” implementation, but 0.442 seconds for the “optimized” implementation. Always have a grain of doubt in CS classes. At least since the CMOVxx instruction (added with Pentium Pro in 1995) there is no chance that the bit hacking solution could have been faster.

On an i5-750 (gcc (Debian 5.2.1-23) 5.2.1 20151028):

    optimized naïve
O0  1.367     0.781
O1  0.530     0.274
O2  0.444     0.271
O3  0.442     0.144
Os  0.446     0.273

Afterthought: Compiler developers are very smart people, who spend their working days finding and implementing optimizations. If the bit hacking trick were faster, then your compiler would implement min() this way. And you can safely assume that the compiler understands what you are doing inside the loop. But the people working for Intel, AMD and so on are smart, too, so they will optimize important operations such as min() and max() if they see that the compiler hackers do weird hacks because the obvious solution is slow.

For the extra-curious: This is the generated code for the “optimized” implementation with -O3:

    mov    $0x40600b00, %ebp     # int *e = &array[COUNT];
    mov    0x600b00, %ebx        # int x = array[0];
    mov    $0x600b04, %edx       # int *i = &array[1];
loop:
    mov    (%rdx), %eax          # int y = *i;
    xor    %ecx, %ecx            # int tmp = (
    cmp    %ebx, %eax            #     y < x
    setl   %cl                   #   ? 1 : 0 );
    xor    %ebx, %eax            # y ^= x;
    add    $0x4, %rdx            # ++i;
    neg    %ecx                  # tmp = -tmp;
    and    %ecx, %eax            # y &= tmp;
    xor    %eax, %ebx            # x ^= y;
    cmp    %rdx, %rbp            # if (i != e) {
    jne    loop                  #    goto loop; }

And the naïve implementation with -Os (-O3 is huge and full of SSE instructions I would have to look up):

    mov    600ac0, %ebx          # int x = array[0];
    mov    $0x40600abc,%ecx      # int *e = &array[COUNT];
    mov    $0x600ac0,%eax        # int *i = &array[0];
loop:
    mov    0x4(%rax),%edx        # int y = *(i + 1);
    cmp    %edx,%ebx             # if (x > y) {
    cmovg  %edx,%ebx             #    x = y; }
    add    $0x4,%rax             # ++i;
    cmp    %rcx,%rax             # if (i != e) {
    jne    loop                  #    goto loop; }

Answered By: kay

Answer 3

I did something like this here a few days ago. It followed on after more obvious examples where jumps (poorly predicted) were killing performance.

Each operation [in Stein’s Algorithm] is very simple: test the least-significant bit, shift right one bit, increment an int. But the branch is a killer!

With a modern superscalar highly-pipelined processing core, a conditional branch breaks the pipeline. The x86 processors use branch prediction and speculative execution to mitigate this, but here the branch decision is essentially random on every iteration. It guesses wrong half the time.

⋮

But I still have one more trick left. if (n>m) std::swap (n, m); is a branch point, and it will take one way or the other many times as it loops. That is, this is another “bad” branch.

Replacing a conditional branch with non-branching bit manipulations (explained in the post; clearer example than the OP) did result in a measurable speedup to the code. This is a different result than noted by another answer, so my “modern” form might work better, and it’s not just a min but both min and max are needed simultaneously so that’s needing more assignments even in the regular implementation.

The result indicates that all this math and register usage is cheaper than branching: 44 becomes 39 or 37, 84 becomes 75. This is about an 11% speedup in the overall algorithm.

Answered By: JDługosz

Answer 4

The way you’re measuring is flawed.

timeit is really complicated to use. When you write this on the commandline:

$ python -m timeit "min(4, 5)"
10000000 loops, best of 3: 0.143 usec per loop

python will gladly tell you that it took 0.143 usec per loop.

$python -m timeit "any([0,3])"
10000000 loops, best of 3: 0.157 usec per loop

Hm, weird, very similar runtime.

Ipython will shed some light:

In [3]: %timeit any([0,3])
The slowest run took 17.13 times longer than the fastest. This could mean that an intermediate result is being cached
10000000 loops, best of 3: 167 ns per loop

Ah stuff is being cached.

In [1]: %timeit min(4,5)
The slowest run took 18.31 times longer than the fastest. This could mean that an intermediate result is being cached
10000000 loops, best of 3: 156 ns per loop

In [4]: %timeit 4 ^ ((5 ^ 4) & -(4 > 5))
The slowest run took 19.02 times longer than the fastest. This could mean that an intermediate result is being cached
10000000 loops, best of 3: 100 ns per loop

I tried many things but I can’t get rid of the caching. I don’t know how to measure this code correctly.

Answered By: Sebastian Wozny

Answer 5

Benchmarking is an art as much as a science. Technical details of the various languages and their internal implementations aside, the statement to be measured is called once in a function call in one example, and within a for loop in the other example where there is an array reference.

The overhead of the function call and the array reference and loops vastly exceeds the time of the function you are trying to measure. Imagine the dozens of instructions required for each. Inside that loop/function call you are trying to measure the speed of just a few instructions!

The C example is much better as it’s much less overhead than the Python example and it is compiled and the machine code carefully analyzed. You can speculate on the speed of that machine code but to actually measure it you need a much more complex benchmark that maximizes the execution of the code you are trying to test and minimized the other instructions. Various compiler optimizations can also distort your timings or even optimize away parts of what you think you are trying to measure!

With the C example, for each iteration, the loop overhead is 4 instructions, and what you are trying to measure is the speed of 1 or 2 instructions depending on the values. That’s very hard to do!

Not to mention you are using elapsed time as a measurement and even on an “idle” system there is plenty of random interrupts, page faults and other activity to distort the timings. You have a huge array which can be faulted in. One operation could be faster on a CISC machine rather than a RISC machine although here I’m assuming you are talking x86 class machines.

I know this doesn’t answer the question, it’s more an analysis of the benchmarking methods used and how they impact getting a real quantifiable answer.

Answered By: Bill

Answer 6

Lets do a slightly deeper dive here to find out the real reason behind this weirdness (if any).

Lets create 3 methods and look at their python bytecode and runtimes…

import dis

def func1(x, y):
    return min(x, y)

def func2(x, y):
    if x < y:
        return x
    return y

def func3(x, y):
    return x ^ ((y ^ x) & -(x > y))

print "*" * 80
dis.dis(func1)
print "*" * 80
dis.dis(func2)
print "*" * 80
dis.dis(func3)

The output from this program is…

*****************************************************
  4           0 LOAD_GLOBAL              0 (min)
              3 LOAD_FAST                0 (x)
              6 LOAD_FAST                1 (y)
              9 CALL_FUNCTION            2
             12 RETURN_VALUE        
*****************************************************
  7           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                1 (y)
              6 COMPARE_OP               0 (<)
              9 POP_JUMP_IF_FALSE       16

  8          12 LOAD_FAST                0 (x)
             15 RETURN_VALUE        

  9     >>   16 LOAD_FAST                1 (y)
             19 RETURN_VALUE        
*****************************************************
 12           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                1 (y)
              6 LOAD_FAST                0 (x)
              9 BINARY_XOR          
             10 LOAD_FAST                0 (x)
             13 LOAD_FAST                1 (y)
             16 COMPARE_OP               4 (>)
             19 UNARY_NEGATIVE      
             20 BINARY_AND          
             21 BINARY_XOR          
             22 RETURN_VALUE

Here are the running times of each of these functions

%timeit func1(4343,434234)
1000000 loops, best of 3: 282 ns per loop

%timeit func2(23432, 3243424)
10000000 loops, best of 3: 137 ns per loop

%timeit func3(928473, 943294)
1000000 loops, best of 3: 246 ns per loop

func2 is the fastest because it has the least amount of work to do in the python interpreter. How?. Looking at the bytecode for func2, we see that in either case of x > y or x < y, the python interpreter will execute 6 instructions.

func3 will execute 11 instructions (and is thus almost twice as slow as func2… in fact, its extremely close to 137.0 * 11 / 6 = 251 ns).

func1 has just 5 python instructions, and by the logic in the previous 2 points, we might think that func1 should probably be the fastest. However, there is a CALL_FUNCTION in there… and function calls have a lot of overhead in Python (because it creates a new eval frame for the function call – that’s the thing that we see in the python stacktrace – a stack of eval frames).

More details : Because python is interpreted, each python bytecode instruction takes a lot longer than a single C/asm statement. In fact, you can take a look at the python interpreter source code to see that each instruction has an overhead of 30 or so C statements (this is from a very rough look at ceval.c python main interpreter loop). The for (;;) loop executes one python instruction per loop cycle (ignoring optimizations).

https://github.com/python/cpython/blob/master/Python/ceval.c#L1221

So, with so much overhead for each instruction, there is no point in comparing 2 tiny C code snippets in python. One will take 34 and the other will take 32 cpu cycles, because the python interpreter adds 30 cycles overhead for each instruction.

In OP’s C module, if we loop inside the C function to do the comparison a million times, that loop will not have the python interpreter’s overhead for each instruction. It will probably run 30 to 40 times faster.

Tips for python optimization…

Profile your code to find hotspots, refactor hot code into its own function (write tests for hotspot before that to make sure refactor does not break stuff), avoid function calls from the hot code (inline functions if possible), use the dis module on new function to find ways to reduce the number of python instructions (if x is faster than if x is True… surprised?), and lastly modify your algorithm. Finally, if python speedup is not enough, reimplement your new function in C.

ps : The explanation above is simplified to keep the answer within reasonable size. For example, not all python instructions take the same amount of time, and there are optimizations, so not every instruction has the same overhead… and lot more things. Please ignore such omissions for the sake of brevity.

Answered By: Jug

Answer 7

Here are some timings on Python 2.7 (’cause I assumed wrong, I’m sorry):

def mymin(x, y):
    if x < y:
        return x
    return y

10000000 loops, best of 3: 0.0897 usec per loop

def mymin(x, y):
    return y

10000000 loops, best of 3: 0.0738 usec per loop

mymin = min

10000000 loops, best of 3: 0.11 usec per loop

mymin = operator.add

10000000 loops, best of 3: 0.0657 usec per loop

What does this mean? It means almost all of the cost is in calling the function. The physical fastest CPython can go here is 0.066 usec per loop, which add achieves.

Your min function in C is going to have

less overhead because it doesn’t deal with arbitrary arguments and cmp, but
more overhead because it generates a new integer, rather than just returning the old one. PyArg_ParseTuple probably isn’t fast, either.

The actual C instructions for comparison or bit shifting cost effectively nothing. They’re free. Amdahl’s law is laughing at you.

Meanwhile, PyPy takes roughly 0.0003 usec per call to min, or 200x less time. Evidently the C instructions are at least that cheap, since they compile to good machine code.

Maybe I’ll put it another way…

What’s more expensive than a branch or comparison?

Allocating, which Python does to allocate the function’s frame and to allocate the tuple to store the arguments in.
String parsing, which PyArg_ParseTuple does.
varargs, also used by PyArg_ParseTuple.
Table lookups, which PyLong_FromLong performs.
Computed gotos, performed by CPython’s internal bytecode dispatch (and I believe 2.7 uses a switch statement, which is even slower).

The body of min, implemented in C, is not the problem.

Answered By: Veedrac

How is `min` of two integers just as fast as 'bit hacking'?

Question:

Answers: