Split string at every position where an upper-case word starts
Question:
What is the best way to split a string like "HELLO there HOW are YOU"
by upper-case words?
So I’d end up with an array like such: results = ['HELLO there', 'HOW are', 'YOU']
I have tried:
p = re.compile("b[A-Z]{2,}b")
print p.split(page_text)
It doesn’t seem to work, though.
Answers:
You could use a lookahead:
re.split(r'[ ](?=[A-Z]+b)', input)
This will split at every space that is followed by a string of upper-case letters which end in a word-boundary.
Note that the square brackets are only for readability and could as well be omitted.
If it is enough that the first letter of a word is upper case (so if you would want to split in front of Hello
as well) it gets even easier:
re.split(r'[ ](?=[A-Z])', input)
Now this splits at every space followed by any upper-case letter.
Your question contains the string literal "b[A-Z]{2,}b"
,
but that b
will mean backspace, because there is no r-modifier.
Try: r"b[A-Z]{2,}b"
.
What is the best way to split a string like "HELLO there HOW are YOU"
by upper-case words?
So I’d end up with an array like such: results = ['HELLO there', 'HOW are', 'YOU']
I have tried:
p = re.compile("b[A-Z]{2,}b")
print p.split(page_text)
It doesn’t seem to work, though.
You could use a lookahead:
re.split(r'[ ](?=[A-Z]+b)', input)
This will split at every space that is followed by a string of upper-case letters which end in a word-boundary.
Note that the square brackets are only for readability and could as well be omitted.
If it is enough that the first letter of a word is upper case (so if you would want to split in front of Hello
as well) it gets even easier:
re.split(r'[ ](?=[A-Z])', input)
Now this splits at every space followed by any upper-case letter.
Your question contains the string literal "b[A-Z]{2,}b"
,
but that b
will mean backspace, because there is no r-modifier.
Try: r"b[A-Z]{2,}b"
.