Why does this specific code run faster in Python 3.11?

Question:

I have the following code in a Python file called benchmark.py.

source = """
for i in range(1000):
    a = len(str(i)) 
"""

import timeit

print(timeit.timeit(stmt=source, number=100000))

When I tried to run with multiple python versions I am seeing a drastic performance difference.

C:UsersUsernameDesktop>py -3.10 benchmark.py
16.79652149998583

C:UsersUsernameDesktop>py -3.11 benchmark.py
10.92280820000451

As you can see this code runs faster with python 3.11 than previous Python versions. I tried to disassemble the bytecode to understand the reason for this behaviour but I could only see a difference in opcode names (CALL_FUNCTION is replaced by PRECALL and CALL opcodes).

I am quite not sure if that’s the reason for this performance change. so I am looking for an answer that justifies with reference to cpython
source code
.

python 3.11 bytecode

  0           0 RESUME                   0

  2           2 PUSH_NULL
              4 LOAD_NAME                0 (range)
              6 LOAD_CONST               0 (1000)
              8 PRECALL                  1
             12 CALL                     1
             22 GET_ITER
        >>   24 FOR_ITER                22 (to 70)
             26 STORE_NAME               1 (i)

  3          28 PUSH_NULL
             30 LOAD_NAME                2 (len)
             32 PUSH_NULL
             34 LOAD_NAME                3 (str)
             36 LOAD_NAME                1 (i)
             38 PRECALL                  1
             42 CALL                     1
             52 PRECALL                  1
             56 CALL                     1
             66 STORE_NAME               4 (a)
             68 JUMP_BACKWARD           23 (to 24)

  2     >>   70 LOAD_CONST               1 (None)
             72 RETURN_VALUE

python 3.10 bytecode

  2           0 LOAD_NAME                0 (range)
              2 LOAD_CONST               0 (1000)
              4 CALL_FUNCTION            1
              6 GET_ITER
        >>    8 FOR_ITER                 8 (to 26)
             10 STORE_NAME               1 (i)

  3          12 LOAD_NAME                2 (len)
             14 LOAD_NAME                3 (str)
             16 LOAD_NAME                1 (i)
             18 CALL_FUNCTION            1
             20 CALL_FUNCTION            1
             22 STORE_NAME               4 (a)
             24 JUMP_ABSOLUTE            4 (to 8)

  2     >>   26 LOAD_CONST               1 (None)
             28 RETURN_VALUE

PS: I understand that python 3.11 introduced bunch of performance improvements but I am curios to understand what optimization makes this code run faster in python 3.11

Asked By: Abdul Niyas P M

||

Answers:

There’s a big section in the "what’s new" page labeled "faster runtime". It looks like the most likely cause of the speedup here is PEP 659, which is a first start towards JIT optimization (perhaps not quite JIT compilation, but definitely JIT optimization).

Particularly, the lookup and call for len and str now bypass a lot of dynamic machinery in the overwhelmingly common case where the built-ins aren’t shadowed or overridden. The global and builtin dict lookups to resolve the name get skipped in a fast path, and the underlying C routines for len and str are called directly, instead of going through the general-purpose function call handling.

You wanted source references, so here’s one. The str call will get specialized in specialize_class_call:

    if (tp->tp_flags & Py_TPFLAGS_IMMUTABLETYPE) {
        if (nargs == 1 && kwnames == NULL && oparg == 1) {
            if (tp == &PyUnicode_Type) {
                _Py_SET_OPCODE(*instr, PRECALL_NO_KW_STR_1);
                return 0;
            }

where it detects that the call is a call to the str builtin with 1 positional argument and no keywords, and replaces the corresponding PRECALL opcode with PRECALL_NO_KW_STR_1. The handling for the PRECALL_NO_KW_STR_1 opcode in the bytecode evaluation loop looks like this:

        TARGET(PRECALL_NO_KW_STR_1) {
            assert(call_shape.kwnames == NULL);
            assert(cframe.use_tracing == 0);
            assert(oparg == 1);
            DEOPT_IF(is_method(stack_pointer, 1), PRECALL);
            PyObject *callable = PEEK(2);
            DEOPT_IF(callable != (PyObject *)&PyUnicode_Type, PRECALL);
            STAT_INC(PRECALL, hit);
            SKIP_CALL();
            PyObject *arg = TOP();
            PyObject *res = PyObject_Str(arg);
            Py_DECREF(arg);
            Py_DECREF(&PyUnicode_Type);
            STACK_SHRINK(2);
            SET_TOP(res);
            if (res == NULL) {
                goto error;
            }
            CHECK_EVAL_BREAKER();
            DISPATCH();
        }

which consists mostly of a bunch of safety prechecks and reference fiddling wrapped around a call to PyObject_Str, the C routine for calling str on an object.

Python 3.11 includes many other performance enhancements besides the above, including optimizations to stack frame creation, method lookup, common arithmetic operations, interpreter startup, and more. Most code should run much faster now, barring things like I/O-bound workloads and code that spent most of its time in C library code (like NumPy).

Answered By: user2357112