Python 3.11 worse optimized than 3.10?

Question:

I run this simple loop with Python 3.10.7 and 3.11.0 on Windows 10.

import time
a = 'a'

start = time.time()
for _ in range(1000000):
    a += 'a'
end = time.time()

print(a[:5], (end-start) * 1000)

The older version executes in 187ms, Python 3.11 needs about 17000ms. Does 3.10 realize that only the first 5 chars of a are needed, whereas 3.11 executes the whole loop? I confirmed this performance difference on godbolt.

Asked By: Michael

||

Answers:

TL;DR: you should not use such a loop in any performance critical code but ''.join instead. The inefficient execution appears to be related to a regression during the bytecode generation in CPython 3.11 (and missing optimizations during the evaluation of binary add operation on Unicode strings).


General guidelines

This is an antipattern. You should not write such a code if you want this to be fast. This is described in PEP-8:

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).
For example, do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn’t present at all in implementations that don’t use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

Indeed, other implementations like PyPy does not perform an efficient in-place string concatenation for example. A new bigger string is created for every iteration (since strings are immutable, the previous one may be referenced and PyPy does not use a reference counting but a garbage collector). This results in a quadratic runtime as opposed to a linear runtime in CPython (at least in past implementation).


Deep Analysis

I can reproduce the problem on Windows 10 between the embedded (64-bit x86-64) version of CPython 3.10.8 and the one of 3.11.0:

Timings:
 - CPython 3.10.8:    146.4 ms
 - CPython 3.11.0:  15186.8 ms

It turns out the code has not particularly changed between CPython 3.10 and 3.11 when it comes to Unicode string appending. See for example PyUnicode_Append: 3.10 and 3.11.

A low-level profiling analysis shows that nearly all the time is spent in one unnamed function call of another unnamed function called by PyUnicode_Concat (which is also left unmodified between CPython 3.10.8 and 3.11.0). This slow unnamed function contains a pretty small set of assembly instructions and nearly all the time is spent in one unique x86-64 assembly instruction: rep movsb byte ptr [rdi], byte ptr [rsi]. This instruction is basically meant to copy a buffer pointed by the rsi register to a buffer pointed by the rdi register (the processor copy rcx bytes for the source buffer to the destination buffer and decrement the rcx register for each byte until it reach 0). This information shows that the unnamed function is actually memcpy of the standard MSVC C runtime (ie. CRT) which appears to be called by _copy_characters itself called by _PyUnicode_FastCopyCharacters of PyUnicode_Concat (all the functions are still belonging to the same file). However, these CPython functions are still left unmodified between CPython 3.10.8 and 3.11.0. The non-negligible time spent in malloc/free (about 0.3 seconds) seems to indicate that a lot of new string objects are created — certainly at least 1 per iteration — matching with the call to PyUnicode_New in the code of PyUnicode_Concat. All of this indicates that a new bigger string is created and copied as specified above.

The thing is calling PyUnicode_Concat is certainly the root of the performance issue here and I think CPython 3.10.8 is faster because it certainly calls PyUnicode_Append instead. Both calls are directly performed by the main big interpreter evaluation loop and this loop is driven by the generated bytecode.

It turns out that the generated bytecode is different between the two version and it is the root of the performance issue. Indeed, CPython 3.10 generates an INPLACE_ADD bytecode instruction while CPython 3.11 generates a BINARY_OP bytecode instruction. Here is the bytecode for the loops in the two versions:

CPython 3.10 loop:

        >>   28 FOR_ITER                 6 (to 42)
             30 STORE_NAME               4 (_)
  6          32 LOAD_NAME                1 (a)
             34 LOAD_CONST               2 ('a')
             36 INPLACE_ADD                             <----------
             38 STORE_NAME               1 (a)
             40 JUMP_ABSOLUTE           14 (to 28)

CPython 3.11 loop:

        >>   66 FOR_ITER                 7 (to 82)
             68 STORE_NAME               4 (_)
  6          70 LOAD_NAME                1 (a)
             72 LOAD_CONST               2 ('a')
             74 BINARY_OP               13 (+=)         <----------
             78 STORE_NAME               1 (a)
             80 JUMP_BACKWARD            8 (to 66)

This changes appears to come from this issue. The code of the main interpreter loop (see ceval.c) is different between the two CPython version. Here are the code executed by the two versions:

        // In CPython 3.10.8
        case TARGET(INPLACE_ADD): {
            PyObject *right = POP();
            PyObject *left = TOP();
            PyObject *sum;
            if (PyUnicode_CheckExact(left) && PyUnicode_CheckExact(right)) {
                sum = unicode_concatenate(tstate, left, right, f, next_instr); // <-----
                /* unicode_concatenate consumed the ref to left */
            }
            else {
                sum = PyNumber_InPlaceAdd(left, right);
                Py_DECREF(left);
            }
            Py_DECREF(right);
            SET_TOP(sum);
            if (sum == NULL)
                goto error;
            DISPATCH();
        }

//----------------------------------------------------------------------------

        // In CPython 3.11.0
        TARGET(BINARY_OP_ADD_UNICODE) {
            assert(cframe.use_tracing == 0);
            PyObject *left = SECOND();
            PyObject *right = TOP();
            DEOPT_IF(!PyUnicode_CheckExact(left), BINARY_OP);
            DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
            STAT_INC(BINARY_OP, hit);
            PyObject *res = PyUnicode_Concat(left, right); // <-----
            STACK_SHRINK(1);
            SET_TOP(res);
            _Py_DECREF_SPECIALIZED(left, _PyUnicode_ExactDealloc);
            _Py_DECREF_SPECIALIZED(right, _PyUnicode_ExactDealloc);
            if (TOP() == NULL) {
                goto error;
            }
            JUMPBY(INLINE_CACHE_ENTRIES_BINARY_OP);
            DISPATCH();
        }

Note that unicode_concatenate calls PyUnicode_Append (and do some reference counting checks before). In the end, CPython 3.10.8 calls PyUnicode_Append which is fast (in-place) and CPython 3.11.0 calls PyUnicode_Concat which is slow (out-of-place). It clearly looks like a regression to me.

People in the comments reported having no performance issue on Linux. However, experimental tests shows a BINARY_OP instruction is also generated on Linux, and I cannot find so far any Linux-specific optimization regarding string concatenation. Thus, the difference between the platforms is pretty surprising.


Update: towards a fix

I have opened an issue about this available here. One should not that putting the code in a function is significantly faster due to the variable being local (as pointed out by @Dennis in the comments).


Related posts:

Answered By: Jérôme Richard

As mentioned in the other answer, this is indeed a regression but it will be fixed in Python 3.12, from the GitHub issue:

FTR, the linear time behavior will be restored for globals (and nonlocals) in 3.12, as a side-effect of the register VM.

Answered By: Caridorc