Split string at every position where an upper-case word starts

Question:

What is the best way to split a string like "HELLO there HOW are YOU" by upper-case words?

So I’d end up with an array like such: results = ['HELLO there', 'HOW are', 'YOU']

I have tried:

p = re.compile("b[A-Z]{2,}b")
print p.split(page_text)

It doesn’t seem to work, though.

Asked By: user179169

||

Answers:

You could use a lookahead:

re.split(r'[ ](?=[A-Z]+b)', input)

This will split at every space that is followed by a string of upper-case letters which end in a word-boundary.

Note that the square brackets are only for readability and could as well be omitted.

If it is enough that the first letter of a word is upper case (so if you would want to split in front of Hello as well) it gets even easier:

re.split(r'[ ](?=[A-Z])', input)

Now this splits at every space followed by any upper-case letter.

Answered By: Martin Ender

I suggest

l = re.compile("(?<!^)s+(?=[A-Z])(?!.s)").split(s)

Check this demo.

Answered By: Ωmega

Your question contains the string literal "b[A-Z]{2,}b",
but that b will mean backspace, because there is no r-modifier.

Try: r"b[A-Z]{2,}b".

Answered By: druid62
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.