Correct way to append to string in python

Question:

I’ve read this reply which explains that CPython has an optimization to do an in-place append without copy when appending to a string using a = a + b or a += b. I’ve also read this PEP8 recommendation:

Code should be written in a way that does not disadvantage other
implementations of Python (PyPy, Jython, IronPython, Cython, Psyco,
and such). For example, do not rely on CPython’s efficient
implementation of in-place string concatenation for statements in the
form a += b or a = a + b. This optimization is fragile even in CPython
(it only works for some types) and isn’t present at all in
implementations that don’t use refcounting. In performance sensitive
parts of the library, the ”.join() form should be used instead. This
will ensure that concatenation occurs in linear time across various
implementations.

So if I understand correctly, instead of doing a += b + c in order to trigger this CPython optimization which does the replacement in-place, the proper way is to call a = ''.join([a, b, c]) ?

But then why is this form with join significantly slower than the form in += in this example (In loop1 I’m using a = a + b + c on purpose in order to not trigger the CPython optimization)?

import os
import time

if __name__ == "__main__":
    start_time = time.time()
    print("begin: %s " % (start_time))
    s = ""
    for i in range(100000):
        s = s + str(i) + '3'
    time1 = time.time()
    print("end loop1: %s " % (time1 - start_time))

    s2 = ""
    for i in range(100000):
        s2 += str(i) + '3'

    time2 = time.time()
    print("end loop2: %s " % (time2 - time1))

    s3 = ""
    for i in range(100000):
        s3 = ''.join([s3, str(i), '3'])

    time3 = time.time()
    print("end loop3: %s " % (time3 - time2))

The results show join is significantly slower in this case:

~/testdir$ python --version
Python 3.10.6
~/testdir$ python concatenate.py 
begin: 1675268345.0761461 
end loop1: 3.9019 
end loop2: 0.0260 
end loop3: 0.9289 

Is my version with join wrong?

Asked By: Étienne

||

Answers:

In "loop3" you bypass a lot of the gain of join() by continuously calling it in an unneeded way. It would be better to build up the full list of characters then join() once.

Check out:

import time

iterations = 100_000

##----------------
s = ""
start_time = time.time()
for i in range(iterations):
    s = s + "." + '3'
end_time = time.time()
print("end loop1: %s " % (end_time - start_time))
##----------------

##----------------
s = ""
start_time = time.time()
for i in range(iterations):
    s += "." + '3'
end_time = time.time()
print("end loop2: %s " % (end_time - start_time))
##----------------

##----------------
s = ""
start_time = time.time()
for i in range(iterations):
    s = ''.join([s, ".", '3'])
end_time = time.time()
print("end loop3: %s " % (end_time - start_time))
##----------------

##----------------
s = []
start_time = time.time()
for i in range(iterations):
    s.append(".")
    s.append("3")
s = "".join(s)
end_time = time.time()
print("end loop4: %s " % (end_time - start_time))
##----------------

##----------------
s = []
start_time = time.time()
for i in range(iterations):
    s.extend((".", "3"))
s = "".join(s)
end_time = time.time()
print("end loop5: %s " % (end_time - start_time))
##----------------

Just to be clear, you can run this with:

iterations = 10_000_000

If you like, just be sure to remove "loop1" and "loop3" as they get dramatically slower after about 300k.

When I run this with 10 million iterations I see:

end loop2: 16.977502584457397 
end loop4: 1.6301295757293701 
end loop5: 1.0435805320739746

So, clearly there is a way to use join() that is fast 🙂

ADDENDUM:

@Étienne has suggested that making the string to append longer reverses the findings and that optimization of loop2 does not happen unless it is in a function. I do not see the same.

import time

iterations = 10_000_000
string_to_append = "345678912"

def loop2(iterations):
    s = ""
    for i in range(iterations):
        s += "." + string_to_append
    return s

def loop4(iterations):
    s = []
    for i in range(iterations):
        s.append(".")
        s.append(string_to_append)
    return "".join(s)

def loop5(iterations):
    s = []
    for i in range(iterations):
        s.extend((".", string_to_append))
    return "".join(s)

##----------------
start_time = time.time()
s = loop2(iterations)
end_time = time.time()
print("end loop2: %s " % (end_time - start_time))
##----------------

##----------------
start_time = time.time()
s = loop4(iterations)
end_time = time.time()
print("end loop4: %s " % (end_time - start_time))
##----------------

##----------------
start_time = time.time()
s = loop5(iterations)
end_time = time.time()
print("end loop5: %s " % (end_time - start_time))
##----------------

On python 3.10 and 3.11 the results are similar. I get results like the following:

end loop2: 336.98531889915466 
end loop4: 1.0211727619171143 
end loop5: 1.1640543937683105

that continue to suggest to me that join() is overwhelmingly faster.

Answered By: JonSG

This is just to add the results from @JonSG answer with different python implementations I have available, posted as an answer, because cannot use formatting in an comment.

The only modification is that I was using 1M iterations and for "local" I’ve wrapped whole test in test() function, doing it inside ‘if name == "main":’ block, doesn’t seem to help with 3.11 regression Étienne mentioned. With 3.12.0a5 I’m seeing similar difference between local and global s variable, but it’s a lot faster.

loop 3.10.10 3.10.10 3.11.2 3.11.2 3.12.0a5 3.12.0a5 pypy 3.9.16 pypy 3.9.16
global local global local global local global local
a = a + b + c 71.04 71.76 92.55 90.57 91.24 92.08 120.05 97.94
a += b + c 0.38 0.20 26.57 0.21 24.06 0.03 108.98 89.62
a = ”.join(a, b, c) 23.26 21.96 25.31 24.60 23.94 23.79 94.04 90.88
a.append(b);a.append(c) 0.50 0.38 0.35 0.23 0.0692 0.0334 0.12 0.12
a.extend((b, c)) 0.35 0.27 0.29 0.19 0.0684 0.0343 0.10 0.10
Answered By: JaMa
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.