How does Python define Whitespace
Question:
Can someone provide a link to the sources where whitespace
as used by default in the functions like strip
and split
is defined?
Do the functions strip(chars)
and split(sep)
use the same whitespaces when chars
resp. sep
are omitted or None? Please give link to source code.
Please note that I’m aware of this description on the documentation of `split:
ASCII whitespace characters (space, tab, return, newline, formfeed, vertical tab)
Answers:
ASCII Whitespace Characters means the section of the Unicode Whitespace Characters that are in the ASCII section. This is defined by the unicode consortium, the reference can be found on their website.
Similarly, according to the documentation of strip
, it defaults to whitespace characters, which, in this context, means what I have mentioned earlier. Same answer from the documentation of split
.
EDIT:
Maybe I wasn’t clear enough. The last two function, which are str.strip
and str.split
, only refer to "whitespaces" in their documentation, which means unicode whitespaces, as defined in the str.isspace
‘s documentation.
Their bytes counterpart, bytes.strip
and bytes.split
, indicate ASCII whitespaces in their documentation.
Also note that the behavior of str.split
is not the same if you don’t provide sep
, and if you feed it the default value of sep
:
>>> "a b".split()
['a', 'b']
>>> "a b".split(" ")
['a', '', 'b']
>>>
TL;DR: "whitespace" in these functions means all Unicode whitespace characters, not just ASCII whitespace.
The source code for the str
class methods can be found in Objects/unicodeobject.c
in the CPython source. Let’s focus only on str.strip
for now:
- The function
unicode_strip_impl
defines the str.strip
method.
- That function delegates directly to
do_argstrip
.
- That function checks if the "separator" argument is not
None
, and otherwise delegates to do_strip
.
- That function has two branches, depending on the internal representation of the string:
-
If the string is encoded in ASCII, the array _Py_ascii_whitespace
is used to check for whitespace characters:
Horizontal tab, line feed, vertical tab, form feed, carriage return (0x09 to 0x0D),
File seperator, group separator, record separator, unit separator, space (0x1C to 0x20).
Note that not all of those characters are included in the string.whitespace
constant defined in the Python standard library.
-
Otherwise, if the string is not internally represented as ASCII, the Py_UNICODE_ISSPACE
function, which calls _PyUnicode_IsWhitespace
, an API function whose code is automatically generated in Tools/unicode/makeunicodedata.py
.
The data used to generate this function comes from the list spaces
, which is populated here with all Unicode characters which are either in the Unicode category Zs
(meaning "space separator") or whose "bidirectional class" is WS
, B
or S
(meaning "white space", "paragraph separator" and "segment separator" respectively).
So the branch for Unicode-encoded strings definitely covers plenty of non-ASCII whitespace characters, such as U+00A0 (non-breaking space), U+2003 (em space) and U+2009 (thin space).
On the other hand, the branch for ASCII-encoded strings seems to just be an optimisation for strings which definitely don’t contain any of the non-ASCII Unicode characters, and it should return the same result as if the other branch had been taken.
The str.split
method is apparently likewise: it’s implemented by the unicode_split_impl
function which delegates to split
, which delegates to one of several functions depending on the internal encoding; the last one is ucs4lib_split_whitespace
, the implementation of which is generated using STRINGLIB_ISSPACE
which is defined (for ucs4lib) as an alias for Py_UNICODE_ISSPACE
.
Can someone provide a link to the sources where whitespace
as used by default in the functions like strip
and split
is defined?
Do the functions strip(chars)
and split(sep)
use the same whitespaces when chars
resp. sep
are omitted or None? Please give link to source code.
Please note that I’m aware of this description on the documentation of `split:
ASCII whitespace characters (space, tab, return, newline, formfeed, vertical tab)
ASCII Whitespace Characters means the section of the Unicode Whitespace Characters that are in the ASCII section. This is defined by the unicode consortium, the reference can be found on their website.
Similarly, according to the documentation of strip
, it defaults to whitespace characters, which, in this context, means what I have mentioned earlier. Same answer from the documentation of split
.
EDIT:
Maybe I wasn’t clear enough. The last two function, which are str.strip
and str.split
, only refer to "whitespaces" in their documentation, which means unicode whitespaces, as defined in the str.isspace
‘s documentation.
Their bytes counterpart, bytes.strip
and bytes.split
, indicate ASCII whitespaces in their documentation.
Also note that the behavior of str.split
is not the same if you don’t provide sep
, and if you feed it the default value of sep
:
>>> "a b".split()
['a', 'b']
>>> "a b".split(" ")
['a', '', 'b']
>>>
TL;DR: "whitespace" in these functions means all Unicode whitespace characters, not just ASCII whitespace.
The source code for the str
class methods can be found in Objects/unicodeobject.c
in the CPython source. Let’s focus only on str.strip
for now:
- The function
unicode_strip_impl
defines thestr.strip
method. - That function delegates directly to
do_argstrip
. - That function checks if the "separator" argument is not
None
, and otherwise delegates todo_strip
. - That function has two branches, depending on the internal representation of the string:
-
If the string is encoded in ASCII, the array
_Py_ascii_whitespace
is used to check for whitespace characters:Horizontal tab, line feed, vertical tab, form feed, carriage return (0x09 to 0x0D),
File seperator, group separator, record separator, unit separator, space (0x1C to 0x20).
Note that not all of those characters are included in the
string.whitespace
constant defined in the Python standard library. -
Otherwise, if the string is not internally represented as ASCII, the
Py_UNICODE_ISSPACE
function, which calls_PyUnicode_IsWhitespace
, an API function whose code is automatically generated inTools/unicode/makeunicodedata.py
.The data used to generate this function comes from the list
spaces
, which is populated here with all Unicode characters which are either in the Unicode categoryZs
(meaning "space separator") or whose "bidirectional class" isWS
,B
orS
(meaning "white space", "paragraph separator" and "segment separator" respectively).
-
So the branch for Unicode-encoded strings definitely covers plenty of non-ASCII whitespace characters, such as U+00A0 (non-breaking space), U+2003 (em space) and U+2009 (thin space).
On the other hand, the branch for ASCII-encoded strings seems to just be an optimisation for strings which definitely don’t contain any of the non-ASCII Unicode characters, and it should return the same result as if the other branch had been taken.
The str.split
method is apparently likewise: it’s implemented by the unicode_split_impl
function which delegates to split
, which delegates to one of several functions depending on the internal encoding; the last one is ucs4lib_split_whitespace
, the implementation of which is generated using STRINGLIB_ISSPACE
which is defined (for ucs4lib) as an alias for Py_UNICODE_ISSPACE
.