Python string comparisons time complexity
Question:
I’m curious how Python performs string comparisons under the hood.
For example if
if s1 == s2:
print(True)
else:
print(False)
is the same as
condition= True
for x,y in zip(s1, s2):
if x != y:
condition = False
print(condition)
Perhaps under the hood python is able to use ord values more efficiently than O(n) traversals?
Answers:
A simple test:
s1 = "a"
s2 = "aa"
condition= True
for x,y in zip(s1, s2):
if x != y:
condition = False
print(condition) # True
show that your assumption is incorrect.
Otherwise, python ==
is very efficient, so you can assume it’s at worse O(n).
Regardless of how it’s implemented, the comparison of two strings is going to take O(n) time. (There might exist pre-built side data structures that could help speed it up, but I’m assuming your input is just two strings and nothing else.)
Yes, the C implementation that ==
ends up calling is much faster, because it’s in C rather than as a Python loop, but its worse-case big-Oh complexity is still going to be O(n).
PS: as @AdvMaple pointed out, your alternative implementation is wrong, because zip
stops as soon as one of its input runs out of elements, but that does not change the time-complexity question.
Python’s string compare is implemented in unicodeobject.c
. After a few checks such as string length and "kind" (python may use 1, 2 or 4 bytes per character depending on unicode USC character size), its just a call to the C lib memcmp
.
With a quick change to your python code
condition = True
if len(s1) != len(s2):
for x,y in zip(s1, s2):
if x != y:
condition = False
break
the python code has the same O(n) time complexity as memcmp, its just that python has a much bigger O. Time complexity doesn’t say anything about how long an operation takes, just how an operation scales with a larger input set n
.
memcmp
is much faster than the python version because of inherent language overhead. But it scales the same. And when you think about it, each of the if x != y:
compares in the second example runs the exact same code as the single s1 == s2
compare in the first.
I’m curious how Python performs string comparisons under the hood.
For example if
if s1 == s2:
print(True)
else:
print(False)
is the same as
condition= True
for x,y in zip(s1, s2):
if x != y:
condition = False
print(condition)
Perhaps under the hood python is able to use ord values more efficiently than O(n) traversals?
A simple test:
s1 = "a"
s2 = "aa"
condition= True
for x,y in zip(s1, s2):
if x != y:
condition = False
print(condition) # True
show that your assumption is incorrect.
Otherwise, python ==
is very efficient, so you can assume it’s at worse O(n).
Regardless of how it’s implemented, the comparison of two strings is going to take O(n) time. (There might exist pre-built side data structures that could help speed it up, but I’m assuming your input is just two strings and nothing else.)
Yes, the C implementation that ==
ends up calling is much faster, because it’s in C rather than as a Python loop, but its worse-case big-Oh complexity is still going to be O(n).
PS: as @AdvMaple pointed out, your alternative implementation is wrong, because zip
stops as soon as one of its input runs out of elements, but that does not change the time-complexity question.
Python’s string compare is implemented in unicodeobject.c
. After a few checks such as string length and "kind" (python may use 1, 2 or 4 bytes per character depending on unicode USC character size), its just a call to the C lib memcmp
.
With a quick change to your python code
condition = True
if len(s1) != len(s2):
for x,y in zip(s1, s2):
if x != y:
condition = False
break
the python code has the same O(n) time complexity as memcmp, its just that python has a much bigger O. Time complexity doesn’t say anything about how long an operation takes, just how an operation scales with a larger input set n
.
memcmp
is much faster than the python version because of inherent language overhead. But it scales the same. And when you think about it, each of the if x != y:
compares in the second example runs the exact same code as the single s1 == s2
compare in the first.