Is there official documentation for python's len() function warning that it can sometimes return apparently wrong values for some strings?
Question:
$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> for name in ["Blue Oyster Cult", "Blue Öyster Cult", "Spinal Tap", "Spın̈al Tap"]:
... print(f'{len(name):3d} {name}')
...
16 Blue Oyster Cult
16 Blue Öyster Cult
10 Spinal Tap
11 Spın̈al Tap
>>> quit()
I’m not asking for an explanation of this behaviour, I’m asking for any official documentation for the len()
function itself saying that it will return a seemingly wrong answer for the last case.
Answers:
The length is not wrong the n̈
is just stored as 2 code points, specifically the numerical values 110 and 776.
You can convert a string to its numerical representation like this:
>>> [ord(c) for c in "Spın̈al Tap"]
[83, 112, 305, 110, 776, 97, 108, 32, 84, 97, 112]
If you convert it to a list of characters like this:
>>> [c for c in "Spın̈al Tap"]
['S', 'p', 'ı', 'n', '̈', 'a', 'l', ' ', 'T', 'a', 'p']
as you can see the n̈
glyph consists of the characters n
and '̈'
len() returns the length of a sequence:
https://docs.python.org/3/library/functions.html#len
A string is defined as a sequence of Unicode code points:
https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str
So by knowing what a Unicode code point is, and of how many this particular string consists, you could have known what len() returns.
The documentation does not explicitly warn you that a string can have a different number of Unicode code points than what you expect.
However, there is a "Unicode HOWTO" that explains a bit more about code points:
There are two things that need to be distinguished. The first is the number of characters in a string. This is given by len()
.
s = "n̈"
print(len(s))
# 2
Which makes sense because n̈ is U+006E U+0308, the letter n, followed by a combining diaeresis.
The second is what is referred to as user-perceived letters (in technical parlance extended grapheme clusters).
import regex as re
def graphemes(text):
return re.findall(r'X',text)
print(len(graphemes(s)))
# 1
So n̈ is two characters, but one grapheme.
Emojis, can be more complex:
e = " "
len(e) # 5
print(len(graphemes(e))) # 1
One emoji created by five Unicode characters.
The length of a string is fluid, different text transformations can change the length of a string. Uppercasing or lowercasing strings can change the length of a string. Casefolding can change the length. Title casing can change the length. Unicode normalisation can change the length.
$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> for name in ["Blue Oyster Cult", "Blue Öyster Cult", "Spinal Tap", "Spın̈al Tap"]:
... print(f'{len(name):3d} {name}')
...
16 Blue Oyster Cult
16 Blue Öyster Cult
10 Spinal Tap
11 Spın̈al Tap
>>> quit()
I’m not asking for an explanation of this behaviour, I’m asking for any official documentation for the len()
function itself saying that it will return a seemingly wrong answer for the last case.
The length is not wrong the n̈
is just stored as 2 code points, specifically the numerical values 110 and 776.
You can convert a string to its numerical representation like this:
>>> [ord(c) for c in "Spın̈al Tap"]
[83, 112, 305, 110, 776, 97, 108, 32, 84, 97, 112]
If you convert it to a list of characters like this:
>>> [c for c in "Spın̈al Tap"]
['S', 'p', 'ı', 'n', '̈', 'a', 'l', ' ', 'T', 'a', 'p']
as you can see the n̈
glyph consists of the characters n
and '̈'
len() returns the length of a sequence:
https://docs.python.org/3/library/functions.html#len
A string is defined as a sequence of Unicode code points:
https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str
So by knowing what a Unicode code point is, and of how many this particular string consists, you could have known what len() returns.
The documentation does not explicitly warn you that a string can have a different number of Unicode code points than what you expect.
However, there is a "Unicode HOWTO" that explains a bit more about code points:
There are two things that need to be distinguished. The first is the number of characters in a string. This is given by len()
.
s = "n̈"
print(len(s))
# 2
Which makes sense because n̈ is U+006E U+0308, the letter n, followed by a combining diaeresis.
The second is what is referred to as user-perceived letters (in technical parlance extended grapheme clusters).
import regex as re
def graphemes(text):
return re.findall(r'X',text)
print(len(graphemes(s)))
# 1
So n̈ is two characters, but one grapheme.
Emojis, can be more complex:
e = " "
len(e) # 5
print(len(graphemes(e))) # 1
One emoji created by five Unicode characters.
The length of a string is fluid, different text transformations can change the length of a string. Uppercasing or lowercasing strings can change the length of a string. Casefolding can change the length. Title casing can change the length. Unicode normalisation can change the length.