Getting x, y coordinates from a Pandas dataframe in which there are multiple location formats

Question:

I am attempting to write code that will iterate through the rows in a Pandas dataframe and will add the rows to a list based on whether the values in a specific column are lat, long coordinates. The column in question has contains the locations in several formats, but we only need the lat, long coordinates. Here is a brief preview of the column and the variation in the formats:

Locations
12010 HWY 61
39.643114,‐104.716489
(40.611712, ‐103.234619), Sterling, CO 80751
39.111393, ‐108.410419
40°29’59.8"N, 104°37’14.9"W

We only need the items in the dataframe that are in the format (x, y), for example (39.643114,‐104.716489) meaning that addresses and items that are in degrees minutes seconds are not needed.

My thinking so far has been to create a new list, define the conditions that we are vetting for, and then create a loop or function that would append these rows to the new list:

xy = []
coords = r'^-?d+.d+,s-?d+.d+$'

for i, row in csg.iterrows():
    if re.match(coords, row["Locations"]):
        xy.append(row)

where "csg" is the name of the dataframe I’m working out of.
For some reason this code block is not picking up any of the lat, long coordinates. Regarding the conditions:

coords = r'^-?d+.d+,s-?d+.d+$'

I’ve tried to make it inclusive of values that may be positive or negative, independent of variances in length, and inclusive of values that may or may not be separated by a space after the comma.

If possible, I would also like to make the eventual code able to pull lat, long coordinates from items in the dataframe that contain the lat, long coordinates as part of a longer location, such as item 3 in the provided example column.

Asked By: cascad

||

Answers:

Your pattern fails to match any of the coordinates because you’re using a Hyphen-Minus while some of the rows in your DataFrame hold a Hyphen. One more point, even with that being handled, your pattern won’t be able to match all the validlat/lon coordinates (e.g 2nd and 3rd rows).

Try this one :

pat = r"(?P<Latitude>[-‐d.]+),s*(?P<Longitude>[-‐d.]+)"

out = csg.join(csg["Locations"].str.extract(pat))

Demo : [Regex101]

Output :

print(out)

                                      Locations   Latitude    Longitude
0                                  12010 HWY 61        NaN          NaN
1                         39.643114,‐104.716489  39.643114  ‐104.716489
2  (40.611712, ‐103.234619), Sterling, CO 80751  40.611712  ‐103.234619
3                        39.111393, ‐108.410419  39.111393  ‐108.410419
4                   40°29'59.8"N, 104°37'14.9"W        NaN          NaN

If you need a list :

l = csg["Locations"].str.extract(pat).dropna().to_numpy().tolist()

[['39.643114', '‐104.716489'],
 ['40.611712', '‐103.234619'],
 ['39.111393', '‐108.410419']]
Answered By: Timeless
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.