time complexity when working with python string
Question:
I’m analysing the complexity of string concatination for different methods and what I notice.
%%time
sent = ""
word = "hello "
num = 100000
for _ in range(num):
sent += word
this method takes 20ms
%%time
sent_1 = "".join([word] * num)
the second method takes 4 ms and
%%time
sent_2 = ""
for _ in range(num//5):
sent_2 = sent_2 + word + word + word + word + word
the third method takes 208 ms
sent==sent_1==sent_2
gives me True
can someone explain why I get such results
Answers:
sent2 + word + word + word + word + word
creates multiple temporary str
objects, roughly equivalent to
t1 = sent2 + word
t2 = t1 + word
t3 = t2 + word
t4 = t3 + word
sent_2= t4 + word
Creating all these objects (and copying the ever-growing values of one temporary value into another) results in a quadratic runtime.
''.join([word] * num)
, on the other hand, gets all the strings to concatenate at once in a single list, so it can allocate one result string and copy the substring in place in linear time.
The first example,
for _ in range(num):
sent += word
should be equivalent to repeated concatenation. But, CPython in particular optimizes this by recognizing that nobody can access the value of sent
while sent + word
is being computed, so it "cheats" and updates sent
in place, with no new string being allocated to assign back to sent
. This means you still have a linear-time operation, like ''.join
, but with extra overhead due to iterating over the string in Python rather than at the C level of the interpreter.
I’m analysing the complexity of string concatination for different methods and what I notice.
%%time
sent = ""
word = "hello "
num = 100000
for _ in range(num):
sent += word
this method takes 20ms
%%time
sent_1 = "".join([word] * num)
the second method takes 4 ms and
%%time
sent_2 = ""
for _ in range(num//5):
sent_2 = sent_2 + word + word + word + word + word
the third method takes 208 ms
sent==sent_1==sent_2
gives me True
can someone explain why I get such results
sent2 + word + word + word + word + word
creates multiple temporary str
objects, roughly equivalent to
t1 = sent2 + word
t2 = t1 + word
t3 = t2 + word
t4 = t3 + word
sent_2= t4 + word
Creating all these objects (and copying the ever-growing values of one temporary value into another) results in a quadratic runtime.
''.join([word] * num)
, on the other hand, gets all the strings to concatenate at once in a single list, so it can allocate one result string and copy the substring in place in linear time.
The first example,
for _ in range(num):
sent += word
should be equivalent to repeated concatenation. But, CPython in particular optimizes this by recognizing that nobody can access the value of sent
while sent + word
is being computed, so it "cheats" and updates sent
in place, with no new string being allocated to assign back to sent
. This means you still have a linear-time operation, like ''.join
, but with extra overhead due to iterating over the string in Python rather than at the C level of the interpreter.