How does Python define Whitespace

Question:

Can someone provide a link to the sources where whitespace as used by default in the functions like strip and split is defined?

Do the functions strip(chars) and split(sep) use the same whitespaces when chars resp. sep are omitted or None? Please give link to source code.

Please note that I’m aware of this description on the documentation of `split:

ASCII whitespace characters (space, tab, return, newline, formfeed, vertical tab)

Asked By: wolfrevo

||

Answers:

ASCII Whitespace Characters means the section of the Unicode Whitespace Characters that are in the ASCII section. This is defined by the unicode consortium, the reference can be found on their website.

Similarly, according to the documentation of strip, it defaults to whitespace characters, which, in this context, means what I have mentioned earlier. Same answer from the documentation of split.

EDIT:

Maybe I wasn’t clear enough. The last two function, which are str.strip and str.split, only refer to "whitespaces" in their documentation, which means unicode whitespaces, as defined in the str.isspace‘s documentation.

Their bytes counterpart, bytes.strip and bytes.split, indicate ASCII whitespaces in their documentation.

Also note that the behavior of str.split is not the same if you don’t provide sep, and if you feed it the default value of sep:

>>> "a  b".split()
['a', 'b']
>>> "a  b".split(" ")
['a', '', 'b']
>>>
Answered By: jthulhu

TL;DR: "whitespace" in these functions means all Unicode whitespace characters, not just ASCII whitespace.


The source code for the str class methods can be found in Objects/unicodeobject.c in the CPython source. Let’s focus only on str.strip for now:

  • The function unicode_strip_impl defines the str.strip method.
  • That function delegates directly to do_argstrip.
  • That function checks if the "separator" argument is not None, and otherwise delegates to do_strip.
  • That function has two branches, depending on the internal representation of the string:
    • If the string is encoded in ASCII, the array _Py_ascii_whitespace is used to check for whitespace characters:

      Horizontal tab, line feed, vertical tab, form feed, carriage return (0x09 to 0x0D),

      File seperator, group separator, record separator, unit separator, space (0x1C to 0x20).

      Note that not all of those characters are included in the string.whitespace constant defined in the Python standard library.

    • Otherwise, if the string is not internally represented as ASCII, the Py_UNICODE_ISSPACE function, which calls _PyUnicode_IsWhitespace, an API function whose code is automatically generated in Tools/unicode/makeunicodedata.py.

      The data used to generate this function comes from the list spaces, which is populated here with all Unicode characters which are either in the Unicode category Zs (meaning "space separator") or whose "bidirectional class" is WS, B or S (meaning "white space", "paragraph separator" and "segment separator" respectively).

So the branch for Unicode-encoded strings definitely covers plenty of non-ASCII whitespace characters, such as U+00A0 (non-breaking space), U+2003 (em space) and U+2009 (thin space).

On the other hand, the branch for ASCII-encoded strings seems to just be an optimisation for strings which definitely don’t contain any of the non-ASCII Unicode characters, and it should return the same result as if the other branch had been taken.


The str.split method is apparently likewise: it’s implemented by the unicode_split_impl function which delegates to split, which delegates to one of several functions depending on the internal encoding; the last one is ucs4lib_split_whitespace, the implementation of which is generated using STRINGLIB_ISSPACE which is defined (for ucs4lib) as an alias for Py_UNICODE_ISSPACE.

Answered By: kaya3
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.