Is there a reason Python 3 enumerates slower than Python 2?

Question:

Python 3 appears to be slower in enumerations for a minimum loop than Python 2 by a significant margin, which appears to be getting worse with newer versions of Python 3.

I have Python 2.7.6, Python 3.3.3, and Python 3.4.0 installed on my 64-bit windows machine, (Intel i7-2700K – 3.5 GHz) with both 32-bit and 64-bit versions of each Python installed. While there is no significant difference in execution speed between 32-bit and 64-bit for a given version within its limitations as to memory access, there is a very significant difference between different version levels. I’ll let the timing results speak for themselves as follows:

C:**Python34_64**python -mtimeit -n 5 -r 2 -s"cnt = 0" "for i in range(10000000): cnt += 1"
5 loops, best of 2: **900 msec** per loop

C:**Python33_64**python -mtimeit -n 5 -r 2 -s"cnt = 0" "for i in range(10000000): cnt += 1"
5 loops, best of 2: **820 msec** per loop

C:**Python27_64**python -mtimeit -n 5 -r 2 -s"cnt = 0" "for i in range(10000000): cnt += 1"
5 loops, best of 2: **480 msec** per loop

Since the Python 3 “range” is not the same as Python 2’s “range”, and is functionally the same as Python 2’s “xrange”, I also timed that as follows:

C:**Python27_64**python -mtimeit -n 5 -r 2 -s"cnt = 0" "for i in **xrange**(10000000): cnt += 1"
5 loops, best of 2: **320 msec** per loop

One can easily see that version 3.3 is almost twice as slow as version 2.7 and Python 3.4 is about 10% slower than that again.

My question: Is there an environment option or setting that corrects this, or is it just inefficient code or the interpreter doing more for the Python 3 version?


The answer seems to be that Python 3 uses the “infinite precision” integers that used to be called “long” in Python 2.x its default “int” type without any option to use the Python 2 fixed bit-length “int” and it is processing of these variable length “int”‘s that is taking the extra time as discussed in the answers and comments below.

It may be that Python 3.4 is somewhat slower than Python 3.3 because of changes to memory allocation to support synchronization that slightly slow memory allocation/deallocation, which is likely the main reason that the current version of “long” processing runs slower.

Asked By: GordonBGood

||

Answers:

The difference is due to the replacement of the int type with the long type. Obviously operations with long integers are going to be slower because the long operations are more complex.

If you force python2 to use longs by setting cnt to 0L the difference goes away:

$python2 -mtimeit -n5 -r2 -s"cnt=0L" "for i in range(10000000): cnt += 1L"
5 loops, best of 2: 1.1 sec per loop
$python3 -mtimeit -n5 -r2 -s"cnt=0" "for i in range(10000000): cnt += 1"
5 loops, best of 2: 686 msec per loop
$python2 -mtimeit -n5 -r2 -s"cnt=0L" "for i in xrange(10000000): cnt += 1L"
5 loops, best of 2: 714 msec per loop

As you can see on my machine python3.4 is faster than both python2 using range and using xrange when using longs. The last benchmark with python’s 2 xrange shows that the difference in this case is minimal.

I don’t have python3.3 installed, so I cannot make a comparison between 3.3 and 3.4, but as far as I know nothing significant changed between these two versions (regarding range), so the timings should be about the same. If you see a significant difference try to inspect the generated bytecode using the dis module. There was a change about memory allocators (PEP 445) but I have no idea whether the default memory allocators were modified and which consequences there were performance-wise.

Answered By: Bakuriu

A summary answer of what I’ve learned from this question might be of help to others who wonder the same things as I did:

  1. The reason for the slowdown is that all integer variables in Python 3.x are now “infinite precision” as the type that used to be called “long” in Python 2.x but is now the only integer type as decided by PEP 237. As per that document, “short” integers that had the bit-depth of the base architecture no longer exist (or only internally).

  2. The old “short” variable operations could run reasonably fast because they could use the underlying machine code operations directly and optimized the allocation of new “int” objects because they always had the same size.

  3. The “long” type is currently only represented by a class object allocated in memory as it could exceed a given fixed length register/memory location’s bit-depth; since these object representations could grow or shrink for various operations and thus have a variable size, they cannot be given a fixed memory allocation and left there.

  4. These “long” types (currently) don’t use a full machine architecture word size but reserve a bit (normally the sign bit) to do overflow checks, thus the “infinite precision long” is divided (currently) into 15-bit/30-bit slice “digits” for 32-bit/64-bit architectures, respectively.

  5. Many of the common uses of these “long” integers won’t require more than one (or maybe two for 32-bit architectures) “digits” as the range of one “digit” is about one billion/32768 for 64-bit/32-bit architectures, respectively.

  6. The ‘C’ code is reasonably efficient for doing one or two “digit” operations, so the performance cost over the simpler “short” integers isn’t all that high for many common uses as far as actual computation goes as compared to the time required to run the byte-code interpreter loop.

  7. The biggest performance hit is likely the constant memory allocations/deallocations, one pair for each loop integer operations which is quite expensive, especially as Python moves to support multi-threading with synchronization locks (which is likely why Python 3.4 is worse than 3.3).

  8. Currently, the implementation always ensures sufficient “digits” by allocating one extra “digit” above the actual size of “digits” used for the biggest operand if there is a possibility that it might “grow”, doing the operation (which may or may not actually use that extra “digit”), and then normalizes the result length to account for the actual number of “digits” used, which may actually stay the same (or possibly “shrink” for some operations); this is done by just reducing the size count in the “long” structure without a new allocation so may waste one “digit” of memory space but saving the performance cost of yet another allocation/deallocation cycle.

  9. There is hope for an performance improvement: For many operations it is possible to predict whether the operation will cause a “grow” or not – for instance, for an addition one just needs to look at the Most Significant Bits (MSB’s) and the operation cannot grow if both MSB’s are zero, which will be the case for many loop/counter operations; a subtraction won’t “grow” depending on the signs and MSB’s of the two operands; a left shift will only “grow” if the MSB is a one; et cetera.

  10. For those cases where the statement is something like “cnt += 1″/”i += step” and so on (opening the possibility of operations in place for many use cases), an “in place” version of the operations could be called which would do the appropriate quick checks and only allocate a new object if a “grow” was necessary, otherwise doing the operation in place of the first operand. The complication would be that the compiler would need to produce these “in-place” byte codes, however, that has already been done, with appropriate special “in-place operation” byte codes produced, just that the current byte-code interpreter directs them to the usual version as described above because they have not yet been implemented (zero’d/null values in the table).

  11. It may well be that all that has to be done is write versions of these “in-place operations” and fill them into the “long” methods table with the byte-code interpreter already finding and running them if they exist or minor changes to a table to cause it to call them being all that is required.

Note that floats are always the same size, so could have this same improvements made, although floats are allocated in blocks of spare locations for better efficiency; it would be much harder to do that for “long”‘s as they take a variable amount of memory.

Also note that this would break the immutability of “long”‘s (and optionally float’s), which is why there are no inplace operators defined, but the fact that they are treated as mutable only for these special cases doesn’t affect the outside world as it would never realize that sometimes a given object has the same address as the old value (as long as equality comparisons look at contents and not just object addresses).

I believe that by avoiding the memory allocation/delocation for these common use cases, the performance of Python 3.x will be quite close to Python 2.7.

Much of what I’ve learned here comes from the Python trunk ‘C’ source file for the “long” object


EDIT_ADD: Whoops, forgot that if variables are sometimes mutable, then closures on local variables don’t work or don’t work without major changes, meaning that the above inplace operations would “break” closures. It would seem that a better solution would be to get advance spare allocation apace working for “long”‘s just as it used to for short integer’s and does for float’s, even if only for the cases where the “long” size doesn’t change (which covers the majority of the time such as for loops and counters as per the question). Doing this should mean that the code doesn’t run that much slower than Python 2 for typical use.

Answered By: GordonBGood