Python – Distance matrix between geographic coordinates
Question:
I have a dataframe panda with over 600 geographic coordinate points. An extract from him follows below:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from math import sin, cos, sqrt, atan2, radians
lat_long = pd.DataFrame({'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
lat_long
To calculate the distance between two points manually, I use the code below:
lat1 = radians(lat_long['LATITUDE'][0])
lon1 = radians(lat_long['LONGITUDE'][0])
lat2 = radians(lat_long['LATITUDE'][1])
lon2 = radians(lat_long['LONGITUDE'][1])
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
print("Result:", round(distance,4))
What I need to do is create a function that uses the formula above to calculate the distance from all points to all, as in an array. But I have trouble thinking about what function to do and store the distances between the points. Every help is welcome. Output example (For illustrative purposes only, if I have not been clear):
| |point 0 | point1 | point2 |
|point0 | 0 | 2 | 3 |
|point1 | 2 | 0 | 4 |
|point2 | 3 | 4 | 0 |
|distance|distance|distance|
Answers:
You could use pdist to compute the pairwise distances:
import pandas as pd
import numpy as np
from math import sin, cos, sqrt, atan2, radians
from scipy.spatial.distance import pdist, squareform
lat_long = pd.DataFrame({'LATITUDE': [-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
def dist(x, y):
"""Function to compute the distance between two points x, y"""
lat1 = radians(x[0])
lon1 = radians(x[1])
lat2 = radians(y[0])
lon2 = radians(y[1])
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return round(distance, 4)
distances = pdist(lat_long.values, metric=dist)
points = [f'point_{i}' for i in range(1, len(lat_long) + 1)]
result = pd.DataFrame(squareform(distances), columns=points, index=points)
print(result)
Output
point_1 point_2 point_3 point_4 point_5
point_1 0.0000 20.5115 8.4123 15.3203 50.1784
point_2 20.5115 0.0000 16.3400 15.8341 30.0319
point_3 8.4123 16.3400 0.0000 6.9086 44.1838
point_4 15.3203 15.8341 6.9086 0.0000 40.0284
point_5 50.1784 30.0319 44.1838 40.0284 0.0000
Notice that squareform
converts from a sparse matrix to a dense one, so the results are store in a numpy array.
Another possible solution is
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from math import sin, cos, sqrt, atan2, radians
lat_long = pd.DataFrame({'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
lat_long
test = lat_long.iloc[2:,:]
def distance(city1, city2):
lat1 = radians(city1['LATITUDE'])
lon1 = radians(city1['LONGITUDE'])
lat2 = radians(city2['LATITUDE'])
lon2 = radians(city2['LONGITUDE'])
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
dist = np.zeros([lat_long.shape[0],lat_long.shape[0]])
for i1, city1 in lat_long.iterrows():
for i2, city2 in lat_long.iloc[i1+1:,:].iterrows():
dist[i1,i2] = distance(city1, city2)
print(dist)
Output
[[ 0. 20.51149047 8.41230771 15.32026132 50.17836849]
[ 0. 0. 16.33997119 15.83407186 30.03192954]
[ 0. 0. 0. 6.90864606 44.18376436]
[ 0. 0. 0. 0. 40.02842872]
[ 0. 0. 0. 0. 0. ]]
The lower triangle of the distance matrix is empty since that the matrix is symmetric (dist[i1,i2]==dist[i2,i1]
)
Using GeoPandas:
import pandas as pd
import geopandas as gpd
lat_long = pd.DataFrame({'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
# Convert Pandas dataframe to GeoPandas dataframe
gdf = gpd.GeoDataFrame(
lat_long,
geometry=gpd.points_from_xy(lat_long['LONGITUDE'], lat_long['LATITUDE']),
crs='EPSG:4326' # Or change to what's appropriate for you.
)
# Calculate distances between points
distances = []
for _, row in gdf.iterrows():
distances.append(gdf['geometry'].distance(row['geometry'])*100)
# Create data frame of distances
distances_df = pd.DataFrame.from_records(distances)
print(distances_df)
Output:
0
1
2
3
4
0
0.000000
20.024984
7.810250
14.212670
48.836462
1
20.024984
0.000000
15.811388
14.866069
29.120440
2
7.810250
15.811388
0.000000
6.403124
43.104524
3
14.212670
14.866069
6.403124
0.000000
39.051248
4
48.836462
29.120440
43.104524
39.051248
0.000000
Note that this output is likely different from other answers because of the Coordinate Reference System (CRS). Find the appropriate CRS for you here.
I have a dataframe panda with over 600 geographic coordinate points. An extract from him follows below:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from math import sin, cos, sqrt, atan2, radians
lat_long = pd.DataFrame({'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
lat_long
To calculate the distance between two points manually, I use the code below:
lat1 = radians(lat_long['LATITUDE'][0])
lon1 = radians(lat_long['LONGITUDE'][0])
lat2 = radians(lat_long['LATITUDE'][1])
lon2 = radians(lat_long['LONGITUDE'][1])
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
print("Result:", round(distance,4))
What I need to do is create a function that uses the formula above to calculate the distance from all points to all, as in an array. But I have trouble thinking about what function to do and store the distances between the points. Every help is welcome. Output example (For illustrative purposes only, if I have not been clear):
| |point 0 | point1 | point2 |
|point0 | 0 | 2 | 3 |
|point1 | 2 | 0 | 4 |
|point2 | 3 | 4 | 0 |
|distance|distance|distance|
You could use pdist to compute the pairwise distances:
import pandas as pd
import numpy as np
from math import sin, cos, sqrt, atan2, radians
from scipy.spatial.distance import pdist, squareform
lat_long = pd.DataFrame({'LATITUDE': [-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
def dist(x, y):
"""Function to compute the distance between two points x, y"""
lat1 = radians(x[0])
lon1 = radians(x[1])
lat2 = radians(y[0])
lon2 = radians(y[1])
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return round(distance, 4)
distances = pdist(lat_long.values, metric=dist)
points = [f'point_{i}' for i in range(1, len(lat_long) + 1)]
result = pd.DataFrame(squareform(distances), columns=points, index=points)
print(result)
Output
point_1 point_2 point_3 point_4 point_5
point_1 0.0000 20.5115 8.4123 15.3203 50.1784
point_2 20.5115 0.0000 16.3400 15.8341 30.0319
point_3 8.4123 16.3400 0.0000 6.9086 44.1838
point_4 15.3203 15.8341 6.9086 0.0000 40.0284
point_5 50.1784 30.0319 44.1838 40.0284 0.0000
Notice that squareform
converts from a sparse matrix to a dense one, so the results are store in a numpy array.
Another possible solution is
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from math import sin, cos, sqrt, atan2, radians
lat_long = pd.DataFrame({'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
lat_long
test = lat_long.iloc[2:,:]
def distance(city1, city2):
lat1 = radians(city1['LATITUDE'])
lon1 = radians(city1['LONGITUDE'])
lat2 = radians(city2['LATITUDE'])
lon2 = radians(city2['LONGITUDE'])
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
dist = np.zeros([lat_long.shape[0],lat_long.shape[0]])
for i1, city1 in lat_long.iterrows():
for i2, city2 in lat_long.iloc[i1+1:,:].iterrows():
dist[i1,i2] = distance(city1, city2)
print(dist)
Output
[[ 0. 20.51149047 8.41230771 15.32026132 50.17836849]
[ 0. 0. 16.33997119 15.83407186 30.03192954]
[ 0. 0. 0. 6.90864606 44.18376436]
[ 0. 0. 0. 0. 40.02842872]
[ 0. 0. 0. 0. 0. ]]
The lower triangle of the distance matrix is empty since that the matrix is symmetric (dist[i1,i2]==dist[i2,i1]
)
Using GeoPandas:
import pandas as pd
import geopandas as gpd
lat_long = pd.DataFrame({'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
# Convert Pandas dataframe to GeoPandas dataframe
gdf = gpd.GeoDataFrame(
lat_long,
geometry=gpd.points_from_xy(lat_long['LONGITUDE'], lat_long['LATITUDE']),
crs='EPSG:4326' # Or change to what's appropriate for you.
)
# Calculate distances between points
distances = []
for _, row in gdf.iterrows():
distances.append(gdf['geometry'].distance(row['geometry'])*100)
# Create data frame of distances
distances_df = pd.DataFrame.from_records(distances)
print(distances_df)
Output:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0.000000 | 20.024984 | 7.810250 | 14.212670 | 48.836462 |
1 | 20.024984 | 0.000000 | 15.811388 | 14.866069 | 29.120440 |
2 | 7.810250 | 15.811388 | 0.000000 | 6.403124 | 43.104524 |
3 | 14.212670 | 14.866069 | 6.403124 | 0.000000 | 39.051248 |
4 | 48.836462 | 29.120440 | 43.104524 | 39.051248 | 0.000000 |
Note that this output is likely different from other answers because of the Coordinate Reference System (CRS). Find the appropriate CRS for you here.