Is divmod() faster than using the % and // operators?
Question:
I remember from assembly that integer division instructions yield both the quotient and remainder. So, in python will the built-in divmod()
function be better performance-wise than using the %
and //
operators (suppose of course one needs both the quotient and the remainder)?
q, r = divmod(n, d)
q, r = (n // d, n % d)
Answers:
To measure is to know (all timings on a Macbook Pro 2.8Ghz i7):
>>> import sys, timeit
>>> sys.version_info
sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
>>> timeit.timeit('divmod(n, d)', 'n, d = 42, 7')
0.1473848819732666
>>> timeit.timeit('n // d, n % d', 'n, d = 42, 7')
0.10324406623840332
The divmod()
function is at a disadvantage here because you need to look up the global each time. Binding it to a local (all variables in a timeit
time trial are local) improves performance a little:
>>> timeit.timeit('dm(n, d)', 'n, d = 42, 7; dm = divmod')
0.13460898399353027
but the operators still win because they don’t have to preserve the current frame while a function call to divmod()
is executed:
>>> import dis
>>> dis.dis(compile('divmod(n, d)', '', 'exec'))
1 0 LOAD_NAME 0 (divmod)
3 LOAD_NAME 1 (n)
6 LOAD_NAME 2 (d)
9 CALL_FUNCTION 2
12 POP_TOP
13 LOAD_CONST 0 (None)
16 RETURN_VALUE
>>> dis.dis(compile('(n // d, n % d)', '', 'exec'))
1 0 LOAD_NAME 0 (n)
3 LOAD_NAME 1 (d)
6 BINARY_FLOOR_DIVIDE
7 LOAD_NAME 0 (n)
10 LOAD_NAME 1 (d)
13 BINARY_MODULO
14 BUILD_TUPLE 2
17 POP_TOP
18 LOAD_CONST 0 (None)
21 RETURN_VALUE
The //
and %
variant uses more opcodes, but the CALL_FUNCTION
bytecode is a bear, performance wise.
In PyPy, for small integers there isn’t really much of a difference; the small speed advantage the opcodes have melts away under the sheer speed of C integer arithmetic:
>>>> import platform, sys, timeit
>>>> platform.python_implementation(), sys.version_info
('PyPy', (major=2, minor=7, micro=10, releaselevel='final', serial=42))
>>>> timeit.timeit('divmod(n, d)', 'n, d = 42, 7', number=10**9)
0.5659301280975342
>>>> timeit.timeit('n // d, n % d', 'n, d = 42, 7', number=10**9)
0.5471200942993164
(I had to crank the number of repetitions up to 1 billion to show how small the difference really is, PyPy is blazingly fast here).
However, when the numbers get large, divmod()
wins by a country mile:
>>>> timeit.timeit('divmod(n, d)', 'n, d = 2**74207281 - 1, 26', number=100)
17.620037078857422
>>>> timeit.timeit('n // d, n % d', 'n, d = 2**74207281 - 1, 26', number=100)
34.44323515892029
(I now had to tune down the number of repetitions by a factor of 10 compared to hobbs’ numbers, just to get a result in a reasonable amount of time).
This is because PyPy no longer can unbox those integers as C integers; you can see the striking difference in timings between using sys.maxint
and sys.maxint + 1
:
>>>> timeit.timeit('divmod(n, d)', 'import sys; n, d = sys.maxint, 26', number=10**7)
0.008622884750366211
>>>> timeit.timeit('n // d, n % d', 'import sys; n, d = sys.maxint, 26', number=10**7)
0.007693052291870117
>>>> timeit.timeit('divmod(n, d)', 'import sys; n, d = sys.maxint + 1, 26', number=10**7)
0.8396248817443848
>>>> timeit.timeit('n // d, n % d', 'import sys; n, d = sys.maxint + 1, 26', number=10**7)
1.0117690563201904
Martijn’s answer is correct if you’re using “small” native integers, where arithmetic operations are very fast compared to function calls. However, with bigints, it’s a whole different story:
>>> import timeit
>>> timeit.timeit('divmod(n, d)', 'n, d = 2**74207281 - 1, 26', number=1000)
24.22666597366333
>>> timeit.timeit('n // d, n % d', 'n, d = 2**74207281 - 1, 26', number=1000)
49.517399072647095
when dividing a 22-million-digit number, divmod is almost exactly twice as fast as doing the division and modulus separately, as you might expect.
On my machine, the crossover occurs somewhere around 2^63, but don’t take my word for it. As Martijn says, measure! When performance really matters, don’t assume that what held true in one place will still be true in another.
I am iterating over an array pulse_onset
to calculate divmod
.
pulse_onset.shape
(14307,)
Using these two variants, divmod
is much faster. My guess is that iterating twice over the array is worse than using divmod
, but I am not familiar enough with the internals.
t0 = time.time()
div_array = np.array([divmod(i, 256) for i in pulse_onset])
t1 = time.time()
total = t1-t0
print(total)
0.008155584335327148
t0 = time.time()
pulse_onset_int = np.array([round(i / 256) for i in pulse_onset])
remainder = np.array([i % 256 for i in pulse_onset])
t1 = time.time()
0.5383238792419434
Doing something like
np.array([(round(i / 256), i % 256) for i in pulse_onset])
Shaves a little bit of the time (0.39537501335144043), so it might not be mostly iterating over the array twice but doing the division twice what has more weight.
I remember from assembly that integer division instructions yield both the quotient and remainder. So, in python will the built-in divmod()
function be better performance-wise than using the %
and //
operators (suppose of course one needs both the quotient and the remainder)?
q, r = divmod(n, d)
q, r = (n // d, n % d)
To measure is to know (all timings on a Macbook Pro 2.8Ghz i7):
>>> import sys, timeit
>>> sys.version_info
sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
>>> timeit.timeit('divmod(n, d)', 'n, d = 42, 7')
0.1473848819732666
>>> timeit.timeit('n // d, n % d', 'n, d = 42, 7')
0.10324406623840332
The divmod()
function is at a disadvantage here because you need to look up the global each time. Binding it to a local (all variables in a timeit
time trial are local) improves performance a little:
>>> timeit.timeit('dm(n, d)', 'n, d = 42, 7; dm = divmod')
0.13460898399353027
but the operators still win because they don’t have to preserve the current frame while a function call to divmod()
is executed:
>>> import dis
>>> dis.dis(compile('divmod(n, d)', '', 'exec'))
1 0 LOAD_NAME 0 (divmod)
3 LOAD_NAME 1 (n)
6 LOAD_NAME 2 (d)
9 CALL_FUNCTION 2
12 POP_TOP
13 LOAD_CONST 0 (None)
16 RETURN_VALUE
>>> dis.dis(compile('(n // d, n % d)', '', 'exec'))
1 0 LOAD_NAME 0 (n)
3 LOAD_NAME 1 (d)
6 BINARY_FLOOR_DIVIDE
7 LOAD_NAME 0 (n)
10 LOAD_NAME 1 (d)
13 BINARY_MODULO
14 BUILD_TUPLE 2
17 POP_TOP
18 LOAD_CONST 0 (None)
21 RETURN_VALUE
The //
and %
variant uses more opcodes, but the CALL_FUNCTION
bytecode is a bear, performance wise.
In PyPy, for small integers there isn’t really much of a difference; the small speed advantage the opcodes have melts away under the sheer speed of C integer arithmetic:
>>>> import platform, sys, timeit
>>>> platform.python_implementation(), sys.version_info
('PyPy', (major=2, minor=7, micro=10, releaselevel='final', serial=42))
>>>> timeit.timeit('divmod(n, d)', 'n, d = 42, 7', number=10**9)
0.5659301280975342
>>>> timeit.timeit('n // d, n % d', 'n, d = 42, 7', number=10**9)
0.5471200942993164
(I had to crank the number of repetitions up to 1 billion to show how small the difference really is, PyPy is blazingly fast here).
However, when the numbers get large, divmod()
wins by a country mile:
>>>> timeit.timeit('divmod(n, d)', 'n, d = 2**74207281 - 1, 26', number=100)
17.620037078857422
>>>> timeit.timeit('n // d, n % d', 'n, d = 2**74207281 - 1, 26', number=100)
34.44323515892029
(I now had to tune down the number of repetitions by a factor of 10 compared to hobbs’ numbers, just to get a result in a reasonable amount of time).
This is because PyPy no longer can unbox those integers as C integers; you can see the striking difference in timings between using sys.maxint
and sys.maxint + 1
:
>>>> timeit.timeit('divmod(n, d)', 'import sys; n, d = sys.maxint, 26', number=10**7)
0.008622884750366211
>>>> timeit.timeit('n // d, n % d', 'import sys; n, d = sys.maxint, 26', number=10**7)
0.007693052291870117
>>>> timeit.timeit('divmod(n, d)', 'import sys; n, d = sys.maxint + 1, 26', number=10**7)
0.8396248817443848
>>>> timeit.timeit('n // d, n % d', 'import sys; n, d = sys.maxint + 1, 26', number=10**7)
1.0117690563201904
Martijn’s answer is correct if you’re using “small” native integers, where arithmetic operations are very fast compared to function calls. However, with bigints, it’s a whole different story:
>>> import timeit
>>> timeit.timeit('divmod(n, d)', 'n, d = 2**74207281 - 1, 26', number=1000)
24.22666597366333
>>> timeit.timeit('n // d, n % d', 'n, d = 2**74207281 - 1, 26', number=1000)
49.517399072647095
when dividing a 22-million-digit number, divmod is almost exactly twice as fast as doing the division and modulus separately, as you might expect.
On my machine, the crossover occurs somewhere around 2^63, but don’t take my word for it. As Martijn says, measure! When performance really matters, don’t assume that what held true in one place will still be true in another.
I am iterating over an array pulse_onset
to calculate divmod
.
pulse_onset.shape
(14307,)
Using these two variants, divmod
is much faster. My guess is that iterating twice over the array is worse than using divmod
, but I am not familiar enough with the internals.
t0 = time.time()
div_array = np.array([divmod(i, 256) for i in pulse_onset])
t1 = time.time()
total = t1-t0
print(total)
0.008155584335327148
t0 = time.time()
pulse_onset_int = np.array([round(i / 256) for i in pulse_onset])
remainder = np.array([i % 256 for i in pulse_onset])
t1 = time.time()
0.5383238792419434
Doing something like
np.array([(round(i / 256), i % 256) for i in pulse_onset])
Shaves a little bit of the time (0.39537501335144043), so it might not be mostly iterating over the array twice but doing the division twice what has more weight.