How can I split at word boundaries with regexes?

Question:

I’m trying to do this:

import re
sentence = "How are you?"
print(re.split(r'b', sentence))

The result being

[u'How are you?']

I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?

Asked By: oarfish

||

Answers:

Unfortunately, Python cannot split by empty strings.

To get around this, you would need to use findall instead of split.

Actually b just means word boundary.

It is equivalent to (?<=w)(?=W)|(?<=W)(?=w).

That means, the following code would work:

import re
sentence = "How are you?"
print(re.findall(r'w+|W+', sentence))
Answered By: Kenny Lau
import re
split = re.findall(r"[w']+|[.,!?;]", "How are you?")
print(split)

Output:

['How', 'are', 'you', '?']

Ideone Demo

Regex101 Demo


Regex Explanation:

"[w']+|[.,!?;]"

    1st Alternative: [w']+
        [w']+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
            w match any word character [a-zA-Z0-9_]
            ' the literal character '
    2nd Alternative: [.,!?;]
        [.,!?;] match a single character present in the list below
            .,!?; a single character in the list .,!?; literally
Answered By: Pedro Lobito

Here is my approach to split on word boundaries:

re.split(r"bWb", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']

and using findall on word boundaries

re.findall(r"bw+b", "How are you?")
# Result: ['How', 'are', 'you']
Answered By: Vishal Kumar Sahu
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.