python split a string with at least 2 whitespaces

Question:

I would like to split a string only where there are at least two or more whitespaces.

For example

str = '10DEUTSCH        GGS Neue Heide 25-27     Wahn-Heide   -1      -1'
print(str.split())

Results:

['10DEUTSCH', 'GGS', 'Neue', 'Heide', '25-27', 'Wahn-Heide', '-1', '-1']

I would like it to look like this:

['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
Asked By: Eagle

||

Answers:

>>> import re    
>>> text = '10DEUTSCH        GGS Neue Heide 25-27     Wahn-Heide   -1      -1'
>>> re.split(r's{2,}', text)
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']

Where

Answered By: unutbu

As has been pointed out, str is not a good name for your string, so using words instead:

output = [s.strip() for s in words.split('  ') if s]

The .split(‘ ‘) — with two spaces — will give you a list that includes empty strings, and items with trailing/leading whitespace. The list comprehension iterates through that list, keeps any non-blank items (if s), and .strip() takes care of any leading/trailing whitespace.

Answered By: toxotes
In [30]: strs='10DEUTSCH        GGS Neue Heide 25-27     Wahn-Heide   -1      -1'

In [38]: filter(None, strs.split("  "))

Out[38]: ['10DEUTSCH', 'GGS Neue Heide 25-27', ' Wahn-Heide', ' -1', '-1']

In [32]: map(str.strip, filter(None, strs.split("  ")))

Out[32]: ['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']

For python 3, wrap the result of filter and map with list to force iteration.

Answered By: Ashwini Chaudhary

In the case of:

  • mixed tabs and spaces
  • blanks at start and/or at end of the string

(originally answering to Split string at whitespace longer than a single space and tab characters, Python)

I would split with a regular expression: 2 or more blanks, then filter out the empty strings that re.split yields:

import re

s = '        1. 1. 2.     1 tNote#EvEt t1t tE3t t  64t        1. 3. 2. 120 n'

result = [x for x in re.split("s{2,}",s) if x]

print(result)

prints:

['1. 1. 2.', '1', 'Note#EvE', '1', 'E3', '64', '1. 3. 2. 120']

this isn’t going to preserve leading/trailing spaces but it’s close.

There’s a slight flaw in the list-comprehension-based solution given earlier. If there are trailing spaces in the input, the split could produce a last element which consists of a single space (or some number of spaces less than n, where n is the minimum number of spaces to split on), which Python considers True in Boolean contexts. Thus the last element in the output could be an unwanted empty string:

>>> s = '10DEUTSCH        GGS Neue Heide 25-27     Wahn-Heide   -1      -1   '
>>> [t.strip() for t in s.split('  ') if t]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1', '']

There are several ways to fix this. One is to strip each element returned by the split before checking its truthiness:

>>> s = '10DEUTSCH        GGS Neue Heide 25-27     Wahn-Heide   -1      -1   '
>>> [t.strip() for t in s.split('  ') if t.strip()]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']

But it seems a bit ugly to strip each token twice. So another way is to strip the input just once at the beginning:

>>> s = '10DEUTSCH        GGS Neue Heide 25-27     Wahn-Heide   -1      -1   '
>>> [t.strip() for t in s.strip().split('  ') if t]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']

That should be good enough if you want to go with a list comprehension. But if you are unhealthily obsessed with preciseness, maybe you will notice that because splitting happens left-to-right, each of the tokens resulting from the split can only have leading spaces, and the unwanted empty string can only happen at the end of the final output. Thus, if it is worth the extra two characters to you, you could go with

>>> s = '10DEUTSCH        GGS Neue Heide 25-27     Wahn-Heide   -1      -1   '
>>> [t.lstrip() for t in s.rstrip().split('  ') if t]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']

If you are using Python 3.8+, you can use the walrus operator to avoid redundant stripping:

>>> s = '10DEUTSCH        GGS Neue Heide 25-27     Wahn-Heide   -1      -1   '
>>> [w for t in s.split('  ') if (w := t.strip())]
['10DEUTSCH', 'GGS Neue Heide 25-27', 'Wahn-Heide', '-1', '-1']
Answered By: John Y
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.