Why is 2 * x * x faster than 2 * ( x * x ) in Python 3.x, for integers?

Question:

The following Python 3.x integer multiplication takes on average between 1.66s and 1.77s:

import time
start_time = time.time()
num = 0
for x in range(0, 10000000):
    # num += 2 * (x * x)
    num += 2 * x * x
print("--- %s seconds ---" % (time.time() - start_time))

if I replace 2 * x * x with 2 *(x * x), it takes between 2.04 and 2.25. How come?

On the other hand it is the opposite in Java: 2 * (x * x) is faster in Java. Java test link: Why is 2 * (i * i) faster than 2 * i * i in Java?

I ran each version of the program 10 times, here are the results.

   2 * x * x        |   2 * (x * x)
---------------------------------------
1.7717654705047607  | 2.0789272785186768
1.735931396484375   | 2.1166207790374756
1.7093875408172607  | 2.024367570877075
1.7004504203796387  | 2.047525405883789
1.6676218509674072  | 2.254328966140747
1.699510097503662   | 2.0949244499206543
1.6889283657073975  | 2.0841963291168213
1.7243537902832031  | 2.1290600299835205
1.712965488433838   | 2.1942825317382812
1.7622807025909424  | 2.1200053691864014
Asked By: Waqas Gondal

||

Answers:

If your benchmark is right (didn’t check), it may come from the fact that Python integers may be two different things : native integers when they are small (with a quick computation), and big integers when they increase in size (slower computation). The first syntax keeps the size smaller after the first operation while the second syntax may lead to two operations involving big integers.

Answered By: Thomas Baruchel

Python intern representation of integers is special, it uses slots of 30 bits :

In [6]: sys.getsizeof(2**30-1)
Out[6]: 28 # one slot + heading

In [7]: sys.getsizeof(2**30)
Out[7]: 32 # two slots 

So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.

For a human who want to calculate 2*4*4, two ways :

  • (2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.
  • 2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.

Python have the same problem. If x is a number such than 2x < B < x² , let x² = aB+b , with a,b <B. is stored in 2 slots, which I note (a|b). Computations leads to (without managing carries here):

   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)
   (2*x)*x =>  (2x)*x =>(2a|2b)

in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.

Answered By: B. M.

First of all, note that we don’t see the same thing in Python 2.x:

>>> timeit("for i in range(1000): 2*i*i")
51.00784397125244
>>> timeit("for i in range(1000): 2*(i*i)")
50.48330092430115

So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.

For small enough integers (including the ones we’re considering here), CPython actually just uses the O(MN) grade-school digit by digit multiplication algorithm (for larger integers it switches to the Karatsuba algorithm). You can see this yourself in the source.

The number of digits in x*x is roughly twice that of 2*x or x (since log(x2) = 2 log(x)). Note that a “digit” in this context is not a base-10 digit, but a 30-bit value (which are treated as single digits in CPython’s implementation). Hence, 2 is a single-digit value, and x and 2*x are single-digit values for all iterations of the loop, but x*x is two-digit for x >= 2**15. Hence, for x >= 2**15, 2*x*x only requires single-by-single-digit multiplications whereas 2*(x*x) requires a single-by-single and a single-by-double-digit multiplication (since x*x has 2 30-bit digits).

Here’s a direct way to see this (Python 3):

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
5.796971936999967
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
4.3559221399999615

Again, compare this to Python 2, which doesn’t use arbitrary-length integers everywhere:

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
3.0912468433380127
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
3.1120400428771973

(One interesting note: If you look at the source, you’ll see that the algorithm actually has a special case for squaring numbers (which we’re doing here), but even still this is not enough to overcome the fact that 2*(x*x) just requires processing more digits.)

Answered By: arshajii

From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:

Relevant part of 2 * x * x:

7          28 LOAD_FAST                1 (num)
           30 LOAD_CONST               3 (2)
           32 LOAD_FAST                2 (x)
           34 BINARY_MULTIPLY
           36 LOAD_FAST                2 (x)
           38 BINARY_MULTIPLY
           40 INPLACE_ADD
           42 STORE_FAST               1 (num)
           44 JUMP_ABSOLUTE           24

Relevant part of 2 * (x * x):

  7          28 LOAD_FAST                1 (num)
             30 LOAD_CONST               3 (2)
             32 LOAD_FAST                2 (x)
             34 LOAD_FAST                2 (x)
             36 BINARY_MULTIPLY                 <=== 1st multiply x*x in a temp value
             38 BINARY_MULTIPLY                 <=== then multiply result with 2
             40 INPLACE_ADD
             42 STORE_FAST               1 (num)
             44 JUMP_ABSOLUTE           24
Answered By: Eric Fortin