python split a string with at least 2 whitespaces
Question:
I would like to split a string only where there are at least two or more whitespaces.
For example
str = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1'
print(str.split())
Results:
['10DEUTSCH', 'GGS', 'Neue', 'Heide', '25-27', 'Wahn-Heide', '-1', '-1']
I would like it to look like this:
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
Answers:
>>> import re
>>> text = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1'
>>> re.split(r's{2,}', text)
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
Where
s
matches any whitespace character, like tnrfv
and more
{2,}
is a repetition, meaning "2 or more"
As has been pointed out, str
is not a good name for your string, so using words
instead:
output = [s.strip() for s in words.split(' ') if s]
The .split(‘ ‘) — with two spaces — will give you a list that includes empty strings, and items with trailing/leading whitespace. The list comprehension iterates through that list, keeps any non-blank items (if s
), and .strip() takes care of any leading/trailing whitespace.
In [30]: strs='10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1'
In [38]: filter(None, strs.split(" "))
Out[38]: ['10DEUTSCH', 'GGS Neue Heide 25-27', ' Wahn-Heide', ' -1', '-1']
In [32]: map(str.strip, filter(None, strs.split(" ")))
Out[32]: ['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
For python 3, wrap the result of filter
and map
with list
to force iteration.
In the case of:
- mixed tabs and spaces
- blanks at start and/or at end of the string
(originally answering to Split string at whitespace longer than a single space and tab characters, Python)
I would split with a regular expression: 2 or more blanks, then filter out the empty strings that re.split
yields:
import re
s = ' 1. 1. 2. 1 tNote#EvEt t1t tE3t t 64t 1. 3. 2. 120 n'
result = [x for x in re.split("s{2,}",s) if x]
print(result)
prints:
['1. 1. 2.', '1', 'Note#EvE', '1', 'E3', '64', '1. 3. 2. 120']
this isn’t going to preserve leading/trailing spaces but it’s close.
There’s a slight flaw in the list-comprehension-based solution given earlier. If there are trailing spaces in the input, the split could produce a last element which consists of a single space (or some number of spaces less than n, where n is the minimum number of spaces to split on), which Python considers True in Boolean contexts. Thus the last element in the output could be an unwanted empty string:
>>> s = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1 '
>>> [t.strip() for t in s.split(' ') if t]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1', '']
There are several ways to fix this. One is to strip each element returned by the split before checking its truthiness:
>>> s = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1 '
>>> [t.strip() for t in s.split(' ') if t.strip()]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
But it seems a bit ugly to strip each token twice. So another way is to strip the input just once at the beginning:
>>> s = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1 '
>>> [t.strip() for t in s.strip().split(' ') if t]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
That should be good enough if you want to go with a list comprehension. But if you are unhealthily obsessed with preciseness, maybe you will notice that because splitting happens left-to-right, each of the tokens resulting from the split can only have leading spaces, and the unwanted empty string can only happen at the end of the final output. Thus, if it is worth the extra two characters to you, you could go with
>>> s = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1 '
>>> [t.lstrip() for t in s.rstrip().split(' ') if t]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
If you are using Python 3.8+, you can use the walrus operator to avoid redundant stripping:
>>> s = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1 '
>>> [w for t in s.split(' ') if (w := t.strip())]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
I would like to split a string only where there are at least two or more whitespaces.
For example
str = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1'
print(str.split())
Results:
['10DEUTSCH', 'GGS', 'Neue', 'Heide', '25-27', 'Wahn-Heide', '-1', '-1']
I would like it to look like this:
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
>>> import re
>>> text = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1'
>>> re.split(r's{2,}', text)
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
Where
s
matches any whitespace character, liketnrfv
and more{2,}
is a repetition, meaning "2 or more"
As has been pointed out, str
is not a good name for your string, so using words
instead:
output = [s.strip() for s in words.split(' ') if s]
The .split(‘ ‘) — with two spaces — will give you a list that includes empty strings, and items with trailing/leading whitespace. The list comprehension iterates through that list, keeps any non-blank items (if s
), and .strip() takes care of any leading/trailing whitespace.
In [30]: strs='10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1'
In [38]: filter(None, strs.split(" "))
Out[38]: ['10DEUTSCH', 'GGS Neue Heide 25-27', ' Wahn-Heide', ' -1', '-1']
In [32]: map(str.strip, filter(None, strs.split(" ")))
Out[32]: ['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
For python 3, wrap the result of filter
and map
with list
to force iteration.
In the case of:
- mixed tabs and spaces
- blanks at start and/or at end of the string
(originally answering to Split string at whitespace longer than a single space and tab characters, Python)
I would split with a regular expression: 2 or more blanks, then filter out the empty strings that re.split
yields:
import re
s = ' 1. 1. 2. 1 tNote#EvEt t1t tE3t t 64t 1. 3. 2. 120 n'
result = [x for x in re.split("s{2,}",s) if x]
print(result)
prints:
['1. 1. 2.', '1', 'Note#EvE', '1', 'E3', '64', '1. 3. 2. 120']
this isn’t going to preserve leading/trailing spaces but it’s close.
There’s a slight flaw in the list-comprehension-based solution given earlier. If there are trailing spaces in the input, the split could produce a last element which consists of a single space (or some number of spaces less than n, where n is the minimum number of spaces to split on), which Python considers True in Boolean contexts. Thus the last element in the output could be an unwanted empty string:
>>> s = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1 '
>>> [t.strip() for t in s.split(' ') if t]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1', '']
There are several ways to fix this. One is to strip each element returned by the split before checking its truthiness:
>>> s = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1 '
>>> [t.strip() for t in s.split(' ') if t.strip()]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
But it seems a bit ugly to strip each token twice. So another way is to strip the input just once at the beginning:
>>> s = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1 '
>>> [t.strip() for t in s.strip().split(' ') if t]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
That should be good enough if you want to go with a list comprehension. But if you are unhealthily obsessed with preciseness, maybe you will notice that because splitting happens left-to-right, each of the tokens resulting from the split can only have leading spaces, and the unwanted empty string can only happen at the end of the final output. Thus, if it is worth the extra two characters to you, you could go with
>>> s = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1 '
>>> [t.lstrip() for t in s.rstrip().split(' ') if t]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
If you are using Python 3.8+, you can use the walrus operator to avoid redundant stripping:
>>> s = '10DEUTSCH GGS Neue Heide 25-27 Wahn-Heide -1 -1 '
>>> [w for t in s.split(' ') if (w := t.strip())]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']