With a dataframe that contains coordinates, find other rows with coordinates within n miles of each row

Question

I’ve been going around in circles and I can’t figure out the right way to do this.

Ok so I have several dataframes. Within each dataframe I have coordinates in column d['LATITUDE'] and d['LONGITUDE']. I want to figure out how many other rows have coordinates within n miles of each row. I created a few new columns to store the counts but actually filing them has me stumped. I’ve read its not good to iterate over rows in a dataframe, but I can’t figure out how to do this a better way. It seems to work but takes hours for each dataframe.

import numpy as np
import pandas as pd
from geopy.distance import geodesic

#example input
df = pd.DataFrame({'LATITUDE':[38.9547, 38.9404, 38.9032, 38.864, 38.8639, 38.9017, 38.947783, 38.8629, 38.9478, 38.9017, 38.9491, 38.8643, 38.8643, 38.9464, 38.903],
       'LONGITUDE':[-77.0463, -77.048, -77.1417, -77.156, -77.1631, -77.131, -77.03983, -77.1558, -77.0439, -77.131, -77.0461, -77.1539, -77.1539, -77.0385, -77.1411]})
    
    
d['1_mile'] = np.nan
d['5_miles'] = np.nan
d['10_miles'] = np.nan
    
        
for n, r in enumerate(d.itertuples(), 1):
        df = d.apply(lambda row: geodesic((d['LATITUDE'].iloc[n], d['LONGITUDE'].iloc[n]), (row['LATITUDE'], row['LONGITUDE'])).miles, axis=1)

        c = len(df[(df<=1) & (df>0)])
        d.at[n,'1_mile'] = c

        c = len(df[(df<=5) & (df>0)])
        d.at[n,'5_miles'] = c

        c = len(df[(df<=10) & (df>0)])
        d.at[n,'10_miles'] = c

#desired output
     LATITUDE  LONGITUDE  1_mile  5_miles  10_miles
0   38.954700  -77.04630     5.0      5.0      14.0
1   38.940400  -77.04800     5.0      5.0      14.0
2   38.903200  -77.14170     3.0      8.0      14.0
3   38.864000  -77.15600     4.0      8.0      14.0
4   38.863900  -77.16310     4.0      8.0      14.0
5   38.901700  -77.13100     2.0      7.0      13.0
6   38.947783  -77.03983     5.0      5.0      14.0
7   38.862900  -77.15580     4.0      8.0      14.0
8   38.947800  -77.04390     5.0      5.0      14.0
9   38.901700  -77.13100     2.0      7.0      13.0
10  38.949100  -77.04610     5.0      5.0      14.0
11  38.864300  -77.15390     3.0      7.0      13.0
12  38.864300  -77.15390     3.0      7.0      13.0
13  38.946400  -77.03850     5.0      5.0      14.0
14  38.903000  -77.14110     3.0      8.0      14.0

It works but I know it’s not the best way to do this. Can anyone please help me clean it up?

Thank you!

Asked By: amance

||

Source

Answer 1

It’s been a year without an answer so figured I should post a solution for anyone with a similar dilemma. Kudos to @Wightboy for suggesting the self join.

Through trial and error, I’ve learned that it’s immensely faster to run the coordinates through a function rather than iterate through each row.

import pandas as pd
import numpy as np

def haversine(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = np.radians([lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    haver_formula = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2

    r = 3958.756 #6371 for distance in KM for miles use 3958.756
    dist = 2 * r * np.arcsin(np.sqrt(haver_formula))
    return pd.Series(dist)

df = pd.DataFrame({'ID':list(range(15)),
       'LATITUDE':[38.9547, 38.9404, 38.9032, 38.864, 38.8639, 38.9017, 38.947783, 38.8629, 38.9478, 38.9017, 38.9491, 38.8643, 38.8643, 38.9464, 38.903],
       'LONGITUDE':[-77.0463, -77.048, -77.1417, -77.156, -77.1631, -77.131, -77.03983, -77.1558, -77.0439, -77.131, -77.0461, -77.1539, -77.1539, -77.0385, -77.1411]})

#self join table
df2 = pd.merge(df.assign(key=1),df.assign(key=1), on='key', suffixes=('', '_2')).drop('key', axis=1)

#drop duplicates
df2 = df2[df2['ID']!=df2['ID_2']].reset_index(drop=True)

#find distances
df2['dist'] = haversine(df2['LONGITUDE'], df2['LATITUDE'], df2['LONGITUDE_2'], df2['LATITUDE_2'])

for m in [1, 5, 10]:
    df = pd.merge(df, df2[['ID', 'dist']][df2['dist']<=m].groupby('ID').count().reset_index().rename(columns={'dist':str(m)+'_mile'}), how='left')


#output
    ID   LATITUDE  LONGITUDE  1_mile  5_mile  10_mile
0    0  38.954700  -77.04630       5       5       14
1    1  38.940400  -77.04800       5       5       14
2    2  38.903200  -77.14170       3       8       14
3    3  38.864000  -77.15600       4       8       14
4    4  38.863900  -77.16310       4       8       14
5    5  38.901700  -77.13100       3       8       14
6    6  38.947783  -77.03983       5       5       14
7    7  38.862900  -77.15580       4       8       14
8    8  38.947800  -77.04390       5       5       14
9    9  38.901700  -77.13100       3       8       14
10  10  38.949100  -77.04610       5       5       14
11  11  38.864300  -77.15390       4       8       14
12  12  38.864300  -77.15390       4       8       14
13  13  38.946400  -77.03850       5       5       14
14  14  38.903000  -77.14110       3       8       14

Answered By: amance

With a dataframe that contains coordinates, find other rows with coordinates within n miles of each row

Question:

Answers: