Why is the float * int multiplication faster than int * float in CPython?
Question:
Basically, the expression 0.4 * a
is consistently, and surprisingly, significantly faster than a * 0.4
. a
being an integer. And I have no idea why.
I speculated that it is a case of a LOAD_CONST LOAD_FAST
bytecode pair being "more specialized" than the LOAD_FAST LOAD_CONST
and I would be entirely satisfied with this explanation, except that this quirk seems to apply only to multiplications where types of multiplied variables differ. (By the way, I can no longer find the link to this "bytecode instruction pair popularity ranking" I once found on github, does anyone have a link?)
Anyway, here are the micro benchmarks:
$ python3.10 -m pyperf timeit -s"a = 9" "a * 0.4"
Mean +- std dev: 34.2 ns +- 0.2 ns
$ python3.10 -m pyperf timeit -s"a = 9" "0.4 * a"
Mean +- std dev: 30.8 ns +- 0.1 ns
$ python3.10 -m pyperf timeit -s"a = 0.4" "a * 9"
Mean +- std dev: 30.3 ns +- 0.3 ns
$ python3.10 -m pyperf timeit -s"a = 0.4" "9 * a"
Mean +- std dev: 33.6 ns +- 0.3 ns
As you can see – in the runs where the float comes first (2nd and 3rd) – it is faster.
So my question is where does this behavior come from? I’m 90% sure that it is an implementation detail of CPython, but I’m not that familiar with low level instructions to state that for sure.
Answers:
It’s CPython’s implementation of the BINARY_MULTIPLY
opcode. It has no idea what the types are at compile-time, so everything has to be figured out at run-time. Regardless of what a
and b
may be, BINARY_MULTIPLY
ends up inoking a.__mul__(b)
.
When a
is of int type int.__mul__(a, b)
has no idea what to do unless b
is also of int type. It returns Py_RETURN_NOTIMPLEMENTED
(an internal C constant). This is in longobject.c
‘s CHECK_BINOP
macro. The interpreter sess that, and effectively says "OK, a.__mul__
has no idea what to do, so let’s give b.__rmul__
a shot at it". None of that is free – it all takes time.
float.__mul__(b, a)
(same as float.__rmul__
) does know what to do with an int (converts it to float first), so that succeeds.
But when a
is of float type to begin with, we go to float.__mul__
first, and that’s the end of it. No time burned figuring out that the int type doesn’t know what to do.
The actual code is quite a bit more involved than the above pretends, but that’s the gist of it.
Basically, the expression 0.4 * a
is consistently, and surprisingly, significantly faster than a * 0.4
. a
being an integer. And I have no idea why.
I speculated that it is a case of a LOAD_CONST LOAD_FAST
bytecode pair being "more specialized" than the LOAD_FAST LOAD_CONST
and I would be entirely satisfied with this explanation, except that this quirk seems to apply only to multiplications where types of multiplied variables differ. (By the way, I can no longer find the link to this "bytecode instruction pair popularity ranking" I once found on github, does anyone have a link?)
Anyway, here are the micro benchmarks:
$ python3.10 -m pyperf timeit -s"a = 9" "a * 0.4"
Mean +- std dev: 34.2 ns +- 0.2 ns
$ python3.10 -m pyperf timeit -s"a = 9" "0.4 * a"
Mean +- std dev: 30.8 ns +- 0.1 ns
$ python3.10 -m pyperf timeit -s"a = 0.4" "a * 9"
Mean +- std dev: 30.3 ns +- 0.3 ns
$ python3.10 -m pyperf timeit -s"a = 0.4" "9 * a"
Mean +- std dev: 33.6 ns +- 0.3 ns
As you can see – in the runs where the float comes first (2nd and 3rd) – it is faster.
So my question is where does this behavior come from? I’m 90% sure that it is an implementation detail of CPython, but I’m not that familiar with low level instructions to state that for sure.
It’s CPython’s implementation of the BINARY_MULTIPLY
opcode. It has no idea what the types are at compile-time, so everything has to be figured out at run-time. Regardless of what a
and b
may be, BINARY_MULTIPLY
ends up inoking a.__mul__(b)
.
When a
is of int type int.__mul__(a, b)
has no idea what to do unless b
is also of int type. It returns Py_RETURN_NOTIMPLEMENTED
(an internal C constant). This is in longobject.c
‘s CHECK_BINOP
macro. The interpreter sess that, and effectively says "OK, a.__mul__
has no idea what to do, so let’s give b.__rmul__
a shot at it". None of that is free – it all takes time.
float.__mul__(b, a)
(same as float.__rmul__
) does know what to do with an int (converts it to float first), so that succeeds.
But when a
is of float type to begin with, we go to float.__mul__
first, and that’s the end of it. No time burned figuring out that the int type doesn’t know what to do.
The actual code is quite a bit more involved than the above pretends, but that’s the gist of it.