Why is variable1 += variable2 much faster than variable1 = variable1 + variable2?

Question:

I have inherited some Python code which is used to create huge tables (of up to 19 columns wide by 5000 rows). It took nine seconds for the table to be drawn on the screen. I noticed that each row was added using this code:

sTable = sTable + 'n' + GetRow()

where sTable is a string.

I changed that to:

sTable += 'n' + GetRow()

and I noticed that the table now appeared in six seconds.

And then I changed it to:

sTable += 'n%s' % GetRow()

based on these Python performance tips (still six seconds).

Since this was called about 5000 times, it highlighted the performance issue. But why was there such a large difference? And why didn’t the compiler spot the problem in the first version and optimise it?

Answers:

This isn’t about using inplace += versus + binary add. You didn’t tell us the whole story. Your original version concatenated 3 strings, not just two:

sTable = sTable + 'n' + sRow  # simplified, sRow is a function call

Python tries to help out and optimises string concatenation; both when using strobj += otherstrobj and strobj = strobj + otherstringobj, but it cannot apply this optimisation when more than 2 strings are involved.

Python strings are immutable normally, but if there are no other references to the left-hand string object and it is being rebound anyway, then Python cheats and mutates the string. This avoids having to create a new string each time you concatenate, and that can lead to a big speed improvement.

This is implemented in the bytecode evaluation loop. Both when using BINARY_ADD on two strings and when using INPLACE_ADD on two strings, Python delegates concatenation to a special helper function string_concatenate(). To be able to optimize the concatenation by mutating the string, it first needs to make sure that the string has no other references to it; if only the stack and the original variable reference it then this can be done, and the next operation is going to replace the original variable reference.

So if there are just 2 references to the string, and the next operator is one of STORE_FAST (set a local variable), STORE_DEREF (set a variable referenced by closed over functions) or STORE_NAME (set a global variable), and the affected variable currently references the same string, then that target variable is cleared to reduce the number of references to just 1, the stack.

And this is why your original code could not use this optimization fully. The first part of your expression is sTable + 'n' and the next operation is another BINARY_ADD:

>>> import dis
>>> dis.dis(compile(r"sTable = sTable + 'n' + sRow", '<stdin>', 'exec'))
  1           0 LOAD_NAME                0 (sTable)
              3 LOAD_CONST               0 ('n')
              6 BINARY_ADD          
              7 LOAD_NAME                1 (sRow)
             10 BINARY_ADD          
             11 STORE_NAME               0 (sTable)
             14 LOAD_CONST               1 (None)
             17 RETURN_VALUE        

The first BINARY_ADD is followed by a LOAD_NAME to access the sRow variable, not a store operation. This first BINARY_ADD must always result in a new string object, ever larger as sTable grows and it takes more and more time to create this new string object.

You changed this code to:

sTable += 'n%s' % sRow

which removed the second concatenation. Now the bytecode is:

>>> dis.dis(compile(r"sTable += 'n%s' % sRow", '<stdin>', 'exec'))
  1           0 LOAD_NAME                0 (sTable)
              3 LOAD_CONST               0 ('n%s')
              6 LOAD_NAME                1 (sRow)
              9 BINARY_MODULO       
             10 INPLACE_ADD         
             11 STORE_NAME               0 (sTable)
             14 LOAD_CONST               1 (None)
             17 RETURN_VALUE        

and all we have left is an INPLACE_ADD followed by a store. Now sTable can be altered in-place, not resulting in a ever larger new string object.

You’d have gotten the same speed difference with:

sTable = sTable + ('n%s' % sRow)

here.

A time trial shows the difference:

>>> import random
>>> from timeit import timeit
>>> testlist = [''.join([chr(random.randint(48, 127)) for _ in range(random.randrange(10, 30))]) for _ in range(1000)]
>>> def str_threevalue_concat(lst):
...     res = ''
...     for elem in lst:
...         res = res + 'n' + elem
... 
>>> def str_twovalue_concat(lst):
...     res = ''
...     for elem in lst:
...         res = res + ('n%s' % elem)
... 
>>> timeit('f(l)', 'from __main__ import testlist as l, str_threevalue_concat as f', number=10000)
6.196403980255127
>>> timeit('f(l)', 'from __main__ import testlist as l, str_twovalue_concat as f', number=10000)
2.3599119186401367

The moral of this story is that you should not be using string concatenation in the first place. The proper way to build a new string from loads of other strings is to use a list, then use str.join():

table_rows = []
for something in something_else:
    table_rows += ['n', GetRow()]
sTable = ''.join(table_rows)

This is faster still:

>>> def str_join_concat(lst):
...     res = ''.join(['n%s' % elem for elem in lst])
... 
>>> timeit('f(l)', 'from __main__ import testlist as l, str_join_concat as f', number=10000)
1.7978830337524414

but you cannot beat using just 'n'.join(lst):

>>> timeit('f(l)', 'from __main__ import testlist as l, nl_join_concat as f', number=10000)
0.23735499382019043
Answered By: Martijn Pieters