splitting a text by a capital letter after a small letter, without loosing the small letter

Question:

I have the following type of strings:
"CanadaUnited States",
"GermanyEnglandSpain"

I want to split them into the countries’ names, i.e.:

[‘Canada’, ‘United States’]
[‘Germany’, ‘England’, ‘Spain’]

I have tried using the following regex:

text = "GermanyEnglandSpain"
re.split('[a-z](?=[A-Z])', text)

and I’m getting:
['German', 'Englan', 'Spain']

How can I not lose the last char in every word?]
Thanks!

Asked By: EyalG

||

Answers:

I would use re.findall here with a regex find all approach:

inp = "CanadaUnited States"
countries = re.findall(r'[A-Z][a-z]+(?: [A-Z][a-z]+)*', inp)
print(countries)  # ['Canada', 'United States']

The regex pattern used here says to match:

  • [A-Z][a-z]+ match a leading uppercase word of a country name
  • (?: [A-Z][a-z]+)* followed by space and another capital word, 0 or more times
Answered By: Tim Biegeleisen

You can use re.split with capture groups like so, but then you will also need to filter out the empty delimeters:

import re

text = "GermanyEnglandSpain"
res = re.split('([A-Z][a-z]*)', text)
res = list(filter(None, res))
print(res)
Answered By: frankenapps

My answer is longer than Tim’s because I wanted to include more cases to the problem so that you can change it as you need it. You can shorten it by using lambda functions and putting multiple regex into one

Basic flow: add a space before every upper letter, replace multiple spaces with *, split on single spaces, and replace * with single space

import re
text = "GermanyUnited  StatesEnglandUnited StatesSpain"
text2=re.sub('([A-Z])', r' 1', text) #adds a single space before every upper letter
print(text2) 
#Germany United   States England United  States Spain
text3=re.sub('s{2,}', '*', text2)#replaces 2 or more spaces with * so that we can replace later
print(text3)
#Germany United*States England United*States Spain
text4=re.split(' ',text3)#splits the text into list on evert single space
print(text4)
#['', 'Germany', 'United*States', 'England', 'United*States', 'Spain']
text5=[]

for i in text4:
  text5.append(re.sub('*', ' ', i)) #replace every * with a single space 
text5=list(filter(None, text5)) #remove empty elements 

print(text5)
#['Germany', 'United States', 'England', 'United States', 'Spain']
Answered By: Mehmet N
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.