How do I extract location names from a string with mixed commas and quotation marks? (using Regex or any other methods)
Question:
I have a string of locations
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
Note that the location names are separated by commas. But for each name with commas in between, it is enclosed in double quotation marks. Also there are prefix/suffix white spaces to be stripped.
After extracting the names into a list, the result should be:
['Los Angeles California', 'Heliopolis, Central, Cairo, Egypt', 'Berlin Germany', 'Paris France', 'Cairo, Egypt', 'Dokki, Giza, Egypt', 'Singapore']
I have tried this and it is able to get the results. But I’m laughing at my work because it looks so cumbersome
import re
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
lis1 = [e.strip() for e in re.findall('"(.*?)"', locations)]
temp = []
for strg in lis1:
temp.extend([x.strip() for x in strg.split(',')])
lis2 = [e.strip() for e in locations.split(',')]
for strg in lis2:
if strg.strip('"').strip() not in temp:
lis1.append(strg)
print(lis1)
So I’m reaching out to the community… Is there a better solution using Regex or any other methods?
Answers:
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
locations = locations.strip(',')
locations=locations.split('"')
result=[]
for i in locations:
i = i.strip()
i = i.rstrip(',')
i = i.lstrip(',')
if i=="":
continue
else:
result.append(i)
print([e.strip() for e in result])
Output
['Los Angeles California',
'Heliopolis, Central, Cairo, Egypt',
'Berlin Germany, Paris France',
'Cairo, Egypt',
'Dokki, Giza, Egypt',
'Singapore']
[l.strip() for l in locations.split(",")]
Try this (this doesn’t use regex)
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
in_string = False
out = ['']
for char in locations:
if char == '"':
in_string = not in_string
continue
if char == ',':
if not in_string:
out.append('')
continue
out[-1] += char
print([x.strip() for x in out])
Output:
['Los Angeles California',
'Heliopolis, Central, Cairo, Egypt',
'Berlin Germany',
'Paris France',
'Cairo, Egypt',
'Dokki, Giza, Egypt',
'Singapore']
I have tried in javascript to get an answer in a single line. Here is another possible solution:
Javascript:
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore, "Kolkata, India", Nepal, Bhutan';
locations.replace(/"[ws, ]+"/gi, x => x.replace(/,/g, '\').replace(/"/g, '').trim()).split(',').map(x => x.replace(/\/g, ',').trim())
Output:
[
"Los Angeles California",
"Heliopolis, Central, Cairo, Egypt",
"Berlin Germany",
"Paris France",
"Cairo, Egypt",
"Dokki, Giza, Egypt",
"Singapore",
"Kolkata, India",
"Nepal",
"Bhutan"
]
Explanation:
- find the combination of strings between
" (double inverted commas)
.
- Then replace all
commas (,)
with Backslash ()
: I am using backslash because it’s not used in Location generally.
- remove
" (double inverted commas)
- Now split the sting with
comma (,)
and replace Backslash ()
with comma (,)
In Python:
import re
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore, "Kolkata, India", Nepal, Bhutan'
l = [e for e in re.sub(""[ws, ]+"", 'xxxxx', locations).split(',') if 'xxxxx' not in e] + re.findall('"(.*?)"', locations)
print([e.strip() for e in l])
Output:
['Los Angeles California',
'Berlin Germany',
'Paris France',
'Singapore',
'Nepal',
'Bhutan',
'Heliopolis, Central, Cairo, Egypt',
'Cairo, Egypt',
'Dokki, Giza, Egypt',
'Kolkata, India']
Here’s another way to solve it
import re
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
lis1 = [e.strip() for e in re.findall('"(.*?)"', locations)]
templis = ''.join(re.split('".*?"', locations))
lis2 = [e.strip() for e in templis.split(',') if len(e.strip()) > 0]
print(lis1 + lis2)
['Heliopolis, Central, Cairo, Egypt',
'Cairo, Egypt',
'Dokki, Giza, Egypt',
'Los Angeles California',
'Berlin Germany',
'Paris France',
'Singapore']
Today I had retried and finally, I did that and got an answer in a single line.
In Javascript:
locations = `Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore, "Kolkata, India", Nepal, Bhutan`;
locations.replace(/"[ws, ]+"/gi, x => x.replace(/,/g, '\').replace(/"/g, '').trim()).split(',').map(x => x.replace(/\/g, ',').trim())
Output:
[
"Los Angeles California",
"Heliopolis, Central, Cairo, Egypt",
"Berlin Germany",
"Paris France",
"Cairo, Egypt",
"Dokki, Giza, Egypt",
"Singapore",
"Kolkata, India",
"Nepal",
"Bhutan"
]
Explanation:
- find the combination of strings between
" (double inverted commas)
.
- Then replace all
commas (,)
with Backslash ()
: I am using backslash because it’s not used in Location generally.
- remove
" (double inverted commas)
- Now split the sting with
comma (,)
and replace Backslash ()
with comma (,)
I am able to write that in python.
str.replace(find_st, x => x.replace(find_st1, rep_st))
Because how I don’t know how I express the above expression in this in Python. Basically the inner function.
Can anyone help to write the above regular expression in Python in a single line?
I have a string of locations
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
Note that the location names are separated by commas. But for each name with commas in between, it is enclosed in double quotation marks. Also there are prefix/suffix white spaces to be stripped.
After extracting the names into a list, the result should be:
['Los Angeles California', 'Heliopolis, Central, Cairo, Egypt', 'Berlin Germany', 'Paris France', 'Cairo, Egypt', 'Dokki, Giza, Egypt', 'Singapore']
I have tried this and it is able to get the results. But I’m laughing at my work because it looks so cumbersome
import re
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
lis1 = [e.strip() for e in re.findall('"(.*?)"', locations)]
temp = []
for strg in lis1:
temp.extend([x.strip() for x in strg.split(',')])
lis2 = [e.strip() for e in locations.split(',')]
for strg in lis2:
if strg.strip('"').strip() not in temp:
lis1.append(strg)
print(lis1)
So I’m reaching out to the community… Is there a better solution using Regex or any other methods?
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
locations = locations.strip(',')
locations=locations.split('"')
result=[]
for i in locations:
i = i.strip()
i = i.rstrip(',')
i = i.lstrip(',')
if i=="":
continue
else:
result.append(i)
print([e.strip() for e in result])
Output
['Los Angeles California',
'Heliopolis, Central, Cairo, Egypt',
'Berlin Germany, Paris France',
'Cairo, Egypt',
'Dokki, Giza, Egypt',
'Singapore']
[l.strip() for l in locations.split(",")]
Try this (this doesn’t use regex)
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
in_string = False
out = ['']
for char in locations:
if char == '"':
in_string = not in_string
continue
if char == ',':
if not in_string:
out.append('')
continue
out[-1] += char
print([x.strip() for x in out])
Output:
['Los Angeles California',
'Heliopolis, Central, Cairo, Egypt',
'Berlin Germany',
'Paris France',
'Cairo, Egypt',
'Dokki, Giza, Egypt',
'Singapore']
I have tried in javascript to get an answer in a single line. Here is another possible solution:
Javascript:
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore, "Kolkata, India", Nepal, Bhutan';
locations.replace(/"[ws, ]+"/gi, x => x.replace(/,/g, '\').replace(/"/g, '').trim()).split(',').map(x => x.replace(/\/g, ',').trim())
Output:
[
"Los Angeles California",
"Heliopolis, Central, Cairo, Egypt",
"Berlin Germany",
"Paris France",
"Cairo, Egypt",
"Dokki, Giza, Egypt",
"Singapore",
"Kolkata, India",
"Nepal",
"Bhutan"
]
Explanation:
- find the combination of strings between
" (double inverted commas)
.- Then replace all
commas (,)
withBackslash ()
: I am using backslash because it’s not used in Location generally. - remove
" (double inverted commas)
- Then replace all
- Now split the sting with
comma (,)
and replaceBackslash ()
withcomma (,)
In Python:
import re
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore, "Kolkata, India", Nepal, Bhutan'
l = [e for e in re.sub(""[ws, ]+"", 'xxxxx', locations).split(',') if 'xxxxx' not in e] + re.findall('"(.*?)"', locations)
print([e.strip() for e in l])
Output:
['Los Angeles California',
'Berlin Germany',
'Paris France',
'Singapore',
'Nepal',
'Bhutan',
'Heliopolis, Central, Cairo, Egypt',
'Cairo, Egypt',
'Dokki, Giza, Egypt',
'Kolkata, India']
Here’s another way to solve it
import re
locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
lis1 = [e.strip() for e in re.findall('"(.*?)"', locations)]
templis = ''.join(re.split('".*?"', locations))
lis2 = [e.strip() for e in templis.split(',') if len(e.strip()) > 0]
print(lis1 + lis2)
['Heliopolis, Central, Cairo, Egypt',
'Cairo, Egypt',
'Dokki, Giza, Egypt',
'Los Angeles California',
'Berlin Germany',
'Paris France',
'Singapore']
Today I had retried and finally, I did that and got an answer in a single line.
In Javascript:
locations = `Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore, "Kolkata, India", Nepal, Bhutan`;
locations.replace(/"[ws, ]+"/gi, x => x.replace(/,/g, '\').replace(/"/g, '').trim()).split(',').map(x => x.replace(/\/g, ',').trim())
Output:
[
"Los Angeles California",
"Heliopolis, Central, Cairo, Egypt",
"Berlin Germany",
"Paris France",
"Cairo, Egypt",
"Dokki, Giza, Egypt",
"Singapore",
"Kolkata, India",
"Nepal",
"Bhutan"
]
Explanation:
- find the combination of strings between
" (double inverted commas)
.- Then replace all
commas (,)
withBackslash ()
: I am using backslash because it’s not used in Location generally. - remove
" (double inverted commas)
- Then replace all
- Now split the sting with
comma (,)
and replaceBackslash ()
withcomma (,)
I am able to write that in python.
str.replace(find_st, x => x.replace(find_st1, rep_st))
Because how I don’t know how I express the above expression in this in Python. Basically the inner function.
Can anyone help to write the above regular expression in Python in a single line?