How can I split at word boundaries with regexes?

Question

I’m trying to do this:

import re
sentence = "How are you?"
print(re.split(r'b', sentence))

The result being

[u'How are you?']

I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?

Asked By: oarfish

||

Source

Answer 1

Unfortunately, Python cannot split by empty strings.

To get around this, you would need to use findall instead of split.

Actually b just means word boundary.

It is equivalent to (?<=w)(?=W)|(?<=W)(?=w).

That means, the following code would work:

import re
sentence = "How are you?"
print(re.findall(r'w+|W+', sentence))

Answered By: Kenny Lau

Answer 2

import re
split = re.findall(r"[w']+|[.,!?;]", "How are you?")
print(split)

Output:

['How', 'are', 'you', '?']

Ideone Demo

Regex101 Demo

Regex Explanation:

"[w']+|[.,!?;]"

    1st Alternative: [w']+
        [w']+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
            w match any word character [a-zA-Z0-9_]
            ' the literal character '
    2nd Alternative: [.,!?;]
        [.,!?;] match a single character present in the list below
            .,!?; a single character in the list .,!?; literally

Answered By: Pedro Lobito

Answer 3

Here is my approach to split on word boundaries:

re.split(r"bWb", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']

and using findall on word boundaries

re.findall(r"bw+b", "How are you?")
# Result: ['How', 'are', 'you']

Answered By: Vishal Kumar Sahu

How can I split at word boundaries with regexes?

Question:

Answers: