Why is x**3 slower than x*x*x?
Question:
In NumPy, x*x*x is an order of magnitude faster than x**3 or even np.power(x, 3).
x = np.random.rand(1e6)
%timeit x**3
100 loops, best of 3: 7.07 ms per loop
%timeit x*x*x
10000 loops, best of 3: 163 µs per loop
%timeit np.power(x, 3)
100 loops, best of 3: 7.15 ms per loop
Any ideas as to why this behavior happens? As far as I can tell all three yield the same output (checked with np.allclose).
Answers:
This is because powers in python are performed as a float operation (this is true for numpy as well, because it uses C).
In C, the pow function provides 3 methods:
double pow (double x, double y)
long powl (long double x, long double y)
float powf (float x, float y)
Each of these are floating point operations.
I expect it is because x**y
must handle the generic case where both x
and y
are floats. Mathematically we can write x**y = exp(y*log(x))
. Following your example I find
x = np.random.rand(1e6)
%timeit x**3
10 loops, best of 3: 178 ms per loop
%timeit np.exp(3*np.log(x))
10 loops, best of 3: 176 ms per loop
I have not checked the actual numpy code but it must be doing something like this internally.
As per this answer, it’s because the implementation of exponentiation has some overhead that multiplication does not. However, naive multiplication will get slower and slower as the exponent increases. An empirical demonstration:
In [3]: x = np.random.rand(1e6)
In [15]: %timeit x**2
100 loops, best of 3: 11.9 ms per loop
In [16]: %timeit x*x
100 loops, best of 3: 12.7 ms per loop
In [17]: %timeit x**3
10 loops, best of 3: 132 ms per loop
In [18]: %timeit x*x*x
10 loops, best of 3: 27.2 ms per loop
In [19]: %timeit x**4
10 loops, best of 3: 132 ms per loop
In [20]: %timeit x*x*x*x
10 loops, best of 3: 42.4 ms per loop
In [21]: %timeit x**10
10 loops, best of 3: 132 ms per loop
In [22]: %timeit x*x*x*x*x*x*x*x*x*x
10 loops, best of 3: 137 ms per loop
In [24]: %timeit x**15
10 loops, best of 3: 132 ms per loop
In [25]: %timeit x*x*x*x*x*x*x*x*x*x*x*x*x*x*x
1 loops, best of 3: 212 ms per loop
Note the exponentiation time stays more or less constant, except for the x**2
case which I suspect is special-cased, while multiplication gets slower and slower. It seems you could exploit this to get faster integer exponentiation… for example:
In [26]: %timeit x**16
10 loops, best of 3: 132 ms per loop
In [27]: %timeit x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x
1 loops, best of 3: 225 ms per loop
In [28]: def tosixteenth(x):
....: x2 = x*x
....: x4 = x2*x2
....: x8 = x4*x4
....: x16 = x8*x8
....: return x16
....:
In [29]: %timeit tosixteenth(x)
10 loops, best of 3: 49.5 ms per loop
It seems you could apply this technique generically by splitting any integer into a sum of the powers of two, computing each power of two as above, and summing:
In [93]: %paste
def smartintexp(x, exp):
result = np.ones(len(x))
curexp = np.array(x)
while True:
if exp%2 == 1:
result *= curexp
exp >>= 1
if not exp: break
curexp *= curexp
return result
## -- End pasted text --
In [94]: x
Out[94]:
array([ 0.0163407 , 0.57694587, 0.47336487, ..., 0.70255032,
0.62043303, 0.0796748 ])
In [99]: x**21
Out[99]:
array([ 3.01080670e-38, 9.63466181e-06, 1.51048544e-07, ...,
6.02873388e-04, 4.43193256e-05, 8.46721060e-24])
In [100]: smartintexp(x, 21)
Out[100]:
array([ 3.01080670e-38, 9.63466181e-06, 1.51048544e-07, ...,
6.02873388e-04, 4.43193256e-05, 8.46721060e-24])
In [101]: %timeit x**21
10 loops, best of 3: 132 ms per loop
In [102]: %timeit smartintexp(x, 21)
10 loops, best of 3: 70.7 ms per loop
It’s fast for small even powers of two:
In [106]: %timeit x**32
10 loops, best of 3: 131 ms per loop
In [107]: %timeit smartintexp(x, 32)
10 loops, best of 3: 57.4 ms per loop
But gets slower as the exponent gets larger:
In [97]: %timeit x**63
10 loops, best of 3: 133 ms per loop
In [98]: %timeit smartintexp(x, 63)
10 loops, best of 3: 110 ms per loop
And not faster for large worst-cases:
In [115]: %timeit x**511
10 loops, best of 3: 135 ms per loop
In [114]: %timeit smartintexp(x, 511)
10 loops, best of 3: 192 ms per loop
timeit np.multiply(np.multiply(x,x),x)
times the same as x*x*x
. My guess is that np.multiply
is using fast Fortran linear algebra package like BLAS. I know from another issue that numpy.dot
uses BLAS for certain cases.
I have to take that back. np.dot(x,x)
is 3x faster than np.sum(x*x)
. So the speed advantage to np.multiply
is not consistent with using BLAS.
With my numpy (times will vary with machine and available libraries)
np.power(x,3.1)
np.exp(3.1*np.log(x))
take about the same time, but
np.power(x,3)
is 2x as fast. Not as fast as x*x*x
, but still faster than the general power. So it is taking some advantage of the integer power.
As a note if you are calculating powers and are worried about speed:
x = np.random.rand(5e7)
%timeit x*x*x
1 loops, best of 3: 522 ms per loop
%timeit np.einsum('i,i,i->i',x,x,x)
1 loops, best of 3: 288 ms per loop
Why einsum is faster is still an open question of mine. Although its like due to einsum
able to use SSE2 while numpy’s ufuncs will not until 1.8.
In place is even faster:
def calc_power(arr):
for x in xrange(arr.shape[0]):
arr[x]=arr[x]*arr[x]*arr[x]
numba_power = autojit(calc_power)
%timeit numba_power(x)
10 loops, best of 3: 51.5 ms per loop
%timeit np.einsum('i,i,i->i',x,x,x,out=x)
10 loops, best of 3: 111 ms per loop
%timeit np.power(x,3,out=x)
1 loops, best of 3: 609 ms per loop
According to the spec:
The two-argument form pow(x, y) is equivalent to using the power
operator: x**y.
The arguments must have numeric types. With mixed operand types, the
coercion rules for binary arithmetic operators apply.
In other words: since x
is a float, the exponent is converted from an int to a float, and the generic floating-point power operation is performed. Internally, this is usually rewritten as:
x**y = 2**(y*lg(x))
2**a
and lg a
(base 2 logarithm of a
) are single instructions on modern processors, but it is still takes much longer than a couple of multiplications.
In NumPy, x*x*x is an order of magnitude faster than x**3 or even np.power(x, 3).
x = np.random.rand(1e6)
%timeit x**3
100 loops, best of 3: 7.07 ms per loop
%timeit x*x*x
10000 loops, best of 3: 163 µs per loop
%timeit np.power(x, 3)
100 loops, best of 3: 7.15 ms per loop
Any ideas as to why this behavior happens? As far as I can tell all three yield the same output (checked with np.allclose).
This is because powers in python are performed as a float operation (this is true for numpy as well, because it uses C).
In C, the pow function provides 3 methods:
double pow (double x, double y)
long powl (long double x, long double y)
float powf (float x, float y)
Each of these are floating point operations.
I expect it is because x**y
must handle the generic case where both x
and y
are floats. Mathematically we can write x**y = exp(y*log(x))
. Following your example I find
x = np.random.rand(1e6)
%timeit x**3
10 loops, best of 3: 178 ms per loop
%timeit np.exp(3*np.log(x))
10 loops, best of 3: 176 ms per loop
I have not checked the actual numpy code but it must be doing something like this internally.
As per this answer, it’s because the implementation of exponentiation has some overhead that multiplication does not. However, naive multiplication will get slower and slower as the exponent increases. An empirical demonstration:
In [3]: x = np.random.rand(1e6)
In [15]: %timeit x**2
100 loops, best of 3: 11.9 ms per loop
In [16]: %timeit x*x
100 loops, best of 3: 12.7 ms per loop
In [17]: %timeit x**3
10 loops, best of 3: 132 ms per loop
In [18]: %timeit x*x*x
10 loops, best of 3: 27.2 ms per loop
In [19]: %timeit x**4
10 loops, best of 3: 132 ms per loop
In [20]: %timeit x*x*x*x
10 loops, best of 3: 42.4 ms per loop
In [21]: %timeit x**10
10 loops, best of 3: 132 ms per loop
In [22]: %timeit x*x*x*x*x*x*x*x*x*x
10 loops, best of 3: 137 ms per loop
In [24]: %timeit x**15
10 loops, best of 3: 132 ms per loop
In [25]: %timeit x*x*x*x*x*x*x*x*x*x*x*x*x*x*x
1 loops, best of 3: 212 ms per loop
Note the exponentiation time stays more or less constant, except for the x**2
case which I suspect is special-cased, while multiplication gets slower and slower. It seems you could exploit this to get faster integer exponentiation… for example:
In [26]: %timeit x**16
10 loops, best of 3: 132 ms per loop
In [27]: %timeit x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x
1 loops, best of 3: 225 ms per loop
In [28]: def tosixteenth(x):
....: x2 = x*x
....: x4 = x2*x2
....: x8 = x4*x4
....: x16 = x8*x8
....: return x16
....:
In [29]: %timeit tosixteenth(x)
10 loops, best of 3: 49.5 ms per loop
It seems you could apply this technique generically by splitting any integer into a sum of the powers of two, computing each power of two as above, and summing:
In [93]: %paste
def smartintexp(x, exp):
result = np.ones(len(x))
curexp = np.array(x)
while True:
if exp%2 == 1:
result *= curexp
exp >>= 1
if not exp: break
curexp *= curexp
return result
## -- End pasted text --
In [94]: x
Out[94]:
array([ 0.0163407 , 0.57694587, 0.47336487, ..., 0.70255032,
0.62043303, 0.0796748 ])
In [99]: x**21
Out[99]:
array([ 3.01080670e-38, 9.63466181e-06, 1.51048544e-07, ...,
6.02873388e-04, 4.43193256e-05, 8.46721060e-24])
In [100]: smartintexp(x, 21)
Out[100]:
array([ 3.01080670e-38, 9.63466181e-06, 1.51048544e-07, ...,
6.02873388e-04, 4.43193256e-05, 8.46721060e-24])
In [101]: %timeit x**21
10 loops, best of 3: 132 ms per loop
In [102]: %timeit smartintexp(x, 21)
10 loops, best of 3: 70.7 ms per loop
It’s fast for small even powers of two:
In [106]: %timeit x**32
10 loops, best of 3: 131 ms per loop
In [107]: %timeit smartintexp(x, 32)
10 loops, best of 3: 57.4 ms per loop
But gets slower as the exponent gets larger:
In [97]: %timeit x**63
10 loops, best of 3: 133 ms per loop
In [98]: %timeit smartintexp(x, 63)
10 loops, best of 3: 110 ms per loop
And not faster for large worst-cases:
In [115]: %timeit x**511
10 loops, best of 3: 135 ms per loop
In [114]: %timeit smartintexp(x, 511)
10 loops, best of 3: 192 ms per loop
timeit np.multiply(np.multiply(x,x),x)
times the same as x*x*x
. My guess is that np.multiply
is using fast Fortran linear algebra package like BLAS. I know from another issue that numpy.dot
uses BLAS for certain cases.
I have to take that back. np.dot(x,x)
is 3x faster than np.sum(x*x)
. So the speed advantage to np.multiply
is not consistent with using BLAS.
With my numpy (times will vary with machine and available libraries)
np.power(x,3.1)
np.exp(3.1*np.log(x))
take about the same time, but
np.power(x,3)
is 2x as fast. Not as fast as x*x*x
, but still faster than the general power. So it is taking some advantage of the integer power.
As a note if you are calculating powers and are worried about speed:
x = np.random.rand(5e7)
%timeit x*x*x
1 loops, best of 3: 522 ms per loop
%timeit np.einsum('i,i,i->i',x,x,x)
1 loops, best of 3: 288 ms per loop
Why einsum is faster is still an open question of mine. Although its like due to einsum
able to use SSE2 while numpy’s ufuncs will not until 1.8.
In place is even faster:
def calc_power(arr):
for x in xrange(arr.shape[0]):
arr[x]=arr[x]*arr[x]*arr[x]
numba_power = autojit(calc_power)
%timeit numba_power(x)
10 loops, best of 3: 51.5 ms per loop
%timeit np.einsum('i,i,i->i',x,x,x,out=x)
10 loops, best of 3: 111 ms per loop
%timeit np.power(x,3,out=x)
1 loops, best of 3: 609 ms per loop
According to the spec:
The two-argument form pow(x, y) is equivalent to using the power
operator: x**y.The arguments must have numeric types. With mixed operand types, the
coercion rules for binary arithmetic operators apply.
In other words: since x
is a float, the exponent is converted from an int to a float, and the generic floating-point power operation is performed. Internally, this is usually rewritten as:
x**y = 2**(y*lg(x))
2**a
and lg a
(base 2 logarithm of a
) are single instructions on modern processors, but it is still takes much longer than a couple of multiplications.