How can I split at word boundaries with regexes?
Question:
I’m trying to do this:
import re
sentence = "How are you?"
print(re.split(r'b', sentence))
The result being
[u'How are you?']
I want something like [u'How', u'are', u'you', u'?']
. How can this be achieved?
Answers:
Unfortunately, Python cannot split by empty strings.
To get around this, you would need to use findall
instead of split
.
Actually b
just means word boundary.
It is equivalent to (?<=w)(?=W)|(?<=W)(?=w)
.
That means, the following code would work:
import re
sentence = "How are you?"
print(re.findall(r'w+|W+', sentence))
import re
split = re.findall(r"[w']+|[.,!?;]", "How are you?")
print(split)
Output:
['How', 'are', 'you', '?']
Regex Explanation:
"[w']+|[.,!?;]"
1st Alternative: [w']+
[w']+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
w match any word character [a-zA-Z0-9_]
' the literal character '
2nd Alternative: [.,!?;]
[.,!?;] match a single character present in the list below
.,!?; a single character in the list .,!?; literally
Here is my approach to split
on word boundaries:
re.split(r"bWb", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']
and using findall
on word boundaries
re.findall(r"bw+b", "How are you?")
# Result: ['How', 'are', 'you']
I’m trying to do this:
import re
sentence = "How are you?"
print(re.split(r'b', sentence))
The result being
[u'How are you?']
I want something like [u'How', u'are', u'you', u'?']
. How can this be achieved?
Unfortunately, Python cannot split by empty strings.
To get around this, you would need to use findall
instead of split
.
Actually b
just means word boundary.
It is equivalent to (?<=w)(?=W)|(?<=W)(?=w)
.
That means, the following code would work:
import re
sentence = "How are you?"
print(re.findall(r'w+|W+', sentence))
import re
split = re.findall(r"[w']+|[.,!?;]", "How are you?")
print(split)
Output:
['How', 'are', 'you', '?']
Regex Explanation:
"[w']+|[.,!?;]"
1st Alternative: [w']+
[w']+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
w match any word character [a-zA-Z0-9_]
' the literal character '
2nd Alternative: [.,!?;]
[.,!?;] match a single character present in the list below
.,!?; a single character in the list .,!?; literally
Here is my approach to split
on word boundaries:
re.split(r"bWb", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']
and using findall
on word boundaries
re.findall(r"bw+b", "How are you?")
# Result: ['How', 'are', 'you']