Is there official documentation for python's len() function warning that it can sometimes return apparently wrong values for some strings?

Question:

$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> for name in ["Blue Oyster Cult", "Blue Öyster Cult", "Spinal Tap", "Spın̈al Tap"]:
...     print(f'{len(name):3d} {name}')
... 
 16 Blue Oyster Cult
 16 Blue Öyster Cult
 10 Spinal Tap
 11 Spın̈al Tap
>>> quit()

I’m not asking for an explanation of this behaviour, I’m asking for any official documentation for the len() function itself saying that it will return a seemingly wrong answer for the last case.

Asked By: Ray Butterworth

||

Answers:

The length is not wrong the is just stored as 2 code points, specifically the numerical values 110 and 776.
You can convert a string to its numerical representation like this:

>>> [ord(c) for c in "Spın̈al Tap"]
[83, 112, 305, 110, 776, 97, 108, 32, 84, 97, 112]

If you convert it to a list of characters like this:

>>> [c for c in "Spın̈al Tap"]
['S', 'p', 'ı', 'n', '̈', 'a', 'l', ' ', 'T', 'a', 'p']

as you can see the glyph consists of the characters n and '̈'

Answered By: jsiller

len() returns the length of a sequence:

https://docs.python.org/3/library/functions.html#len

A string is defined as a sequence of Unicode code points:

https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str

So by knowing what a Unicode code point is, and of how many this particular string consists, you could have known what len() returns.

The documentation does not explicitly warn you that a string can have a different number of Unicode code points than what you expect.

However, there is a "Unicode HOWTO" that explains a bit more about code points:

https://docs.python.org/howto/unicode.html

Answered By: mkrieger1

There are two things that need to be distinguished. The first is the number of characters in a string. This is given by len().

s = "n̈"
print(len(s))
# 2

Which makes sense because n̈ is U+006E U+0308, the letter n, followed by a combining diaeresis.

The second is what is referred to as user-perceived letters (in technical parlance extended grapheme clusters).

import regex as re
def graphemes(text):
    return re.findall(r'X',text)
print(len(graphemes(s)))
# 1

So n̈ is two characters, but one grapheme.

Emojis, can be more complex:

e = " ‍ ‍ "
len(e)                     # 5
print(len(graphemes(e)))   # 1

One emoji created by five Unicode characters.

The length of a string is fluid, different text transformations can change the length of a string. Uppercasing or lowercasing strings can change the length of a string. Casefolding can change the length. Title casing can change the length. Unicode normalisation can change the length.

Answered By: Andj
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.