How do I extract location names from a string with mixed commas and quotation marks? (using Regex or any other methods)

Question:

I have a string of locations

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'

Note that the location names are separated by commas. But for each name with commas in between, it is enclosed in double quotation marks. Also there are prefix/suffix white spaces to be stripped.

After extracting the names into a list, the result should be:

['Los Angeles California', 'Heliopolis, Central, Cairo, Egypt', 'Berlin Germany', 'Paris France', 'Cairo, Egypt', 'Dokki, Giza, Egypt', 'Singapore']

I have tried this and it is able to get the results. But I’m laughing at my work because it looks so cumbersome

import re

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
lis1 = [e.strip() for e in re.findall('"(.*?)"', locations)]
temp = []
for strg in lis1:
    temp.extend([x.strip() for x in strg.split(',')])
lis2 = [e.strip() for e in locations.split(',')]
for strg in lis2:
    if strg.strip('"').strip() not in temp:
        lis1.append(strg)
print(lis1)

So I’m reaching out to the community… Is there a better solution using Regex or any other methods?

Asked By: perpetualstudent

||

Answers:

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
locations = locations.strip(',')
locations=locations.split('"')

result=[]
for i in locations:
    i = i.strip()
    i = i.rstrip(',')
    i = i.lstrip(',')
    if i=="":
        continue
    else:
        result.append(i)

print([e.strip() for e in result])

Output

['Los Angeles California',
 'Heliopolis, Central, Cairo, Egypt',
 'Berlin Germany, Paris France',
 'Cairo, Egypt',
 'Dokki, Giza, Egypt',
 'Singapore']
Answered By: Mehmaam
[l.strip() for l in locations.split(",")]
Answered By: Sachin Salve

Try this (this doesn’t use regex)

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'

in_string = False
out = ['']

for char in locations:
    if char == '"':
        in_string = not in_string
        continue
    if char == ',':
        if not in_string:
            out.append('')
            continue
    out[-1] += char

print([x.strip() for x in out])

Output:

['Los Angeles California',
 'Heliopolis, Central, Cairo, Egypt',
 'Berlin Germany',
 'Paris France',
 'Cairo, Egypt',
 'Dokki, Giza, Egypt',
 'Singapore']
Answered By: The Thonnu

I have tried in javascript to get an answer in a single line. Here is another possible solution:

Javascript:

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore, "Kolkata, India", Nepal, Bhutan';

locations.replace(/"[ws, ]+"/gi, x => x.replace(/,/g, '\').replace(/"/g, '').trim()).split(',').map(x => x.replace(/\/g, ',').trim())

Output:

[
  "Los Angeles California", 
   "Heliopolis, Central, Cairo, Egypt", 
   "Berlin Germany", 
   "Paris France", 
   "Cairo, Egypt", 
   "Dokki, Giza, Egypt", 
   "Singapore", 
   "Kolkata, India", 
   "Nepal", 
   "Bhutan"
] 

Explanation:

  • find the combination of strings between " (double inverted commas).
    • Then replace all commas (,) with Backslash () : I am using backslash because it’s not used in Location generally.
    • remove " (double inverted commas)
  • Now split the sting with comma (,) and replace Backslash () with comma (,)

In Python:

import re

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore, "Kolkata, India", Nepal, Bhutan'

l = [e for e in re.sub(""[ws, ]+"", 'xxxxx', locations).split(',') if 'xxxxx' not in e] + re.findall('"(.*?)"', locations)
print([e.strip() for e in l])

Output:

['Los Angeles California',
 'Berlin Germany',
 'Paris France',
 'Singapore',
 'Nepal',
 'Bhutan',
 'Heliopolis, Central, Cairo, Egypt',
 'Cairo, Egypt',
 'Dokki, Giza, Egypt',
 'Kolkata, India']
Answered By: Art Bindu

Here’s another way to solve it

import re 

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
lis1 = [e.strip() for e in re.findall('"(.*?)"', locations)]
templis = ''.join(re.split('".*?"', locations))
lis2 = [e.strip() for e in templis.split(',') if len(e.strip()) > 0]

print(lis1 + lis2)

['Heliopolis, Central, Cairo, Egypt',
 'Cairo, Egypt',
 'Dokki, Giza, Egypt',
 'Los Angeles California',
 'Berlin Germany',
 'Paris France',
 'Singapore']
Answered By: Gold79

Today I had retried and finally, I did that and got an answer in a single line.

In Javascript:

locations = `Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore, "Kolkata, India", Nepal, Bhutan`;

locations.replace(/"[ws, ]+"/gi, x => x.replace(/,/g, '\').replace(/"/g, '').trim()).split(',').map(x => x.replace(/\/g, ',').trim())

Output:

[
  "Los Angeles California", 
   "Heliopolis, Central, Cairo, Egypt", 
   "Berlin Germany", 
   "Paris France", 
   "Cairo, Egypt", 
   "Dokki, Giza, Egypt", 
   "Singapore", 
   "Kolkata, India", 
   "Nepal", 
   "Bhutan"
] 

Explanation:

  • find the combination of strings between " (double inverted commas).
    • Then replace all commas (,) with Backslash () : I am using backslash because it’s not used in Location generally.
    • remove " (double inverted commas)
  • Now split the sting with comma (,) and replace Backslash () with comma (,)

I am able to write that in python.

str.replace(find_st, x => x.replace(find_st1, rep_st))

Because how I don’t know how I express the above expression in this in Python. Basically the inner function.

Can anyone help to write the above regular expression in Python in a single line?

Answered By: Art Bindu
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.