Creating a Distance Matrix?

Question:

I am currently reading in data into a dataframe that looks like this.

City         XCord    YCord   
Boston         5        2
Phoenix        7        3
New York       8        1
.....          .        .

I want to to create a Euclidean Distance Matrix from this data showing the distance between all city pairs so I get a resulting matrix like:

             Boston    Phoenix   New York
Boston         0        2.236      3.162
Phoenix        2.236      0        2.236
New York       3.162    2.236        0

There are many more cities and coordinates in my actual data frame so i need to to be able to somehow iterate over all of the city pairs and create a distance matrix like the one I have shown above but I am not sure how to pair all of the cites together and apply the Euclidean Distance formula? Any help would be appreciated.

Asked By: Jeremy

||

Answers:

I think you are intrested in distance_matrix.

For example:

Create data:

import pandas as pd
from scipy.spatial import distance_matrix
    
data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)

Output:

          xcord ycord
Boston      5   7
Phoenix     7   3
New York    8   1

Using the distance matrix function:

 pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)

Results:

          Boston    Phoenix     New York
Boston    0.000000  4.472136    6.708204
Phoenix   4.472136  0.000000    2.236068
New York  6.708204  2.236068    0.000000
Answered By: Andrew

I will give a method in pure python.

Import a sqrt function from math module:

from math import sqrt

Let assume that you have your coordinates in cords table in the following way:

cords['Boston'] = (5, 2)

Define a function to compute Euclidean distance of two given 2d points:

def dist(a, b):
    d = [a[0] - b[0], a[1] - b[1]]
    return sqrt(d[0] * d[0] + d[1] * d[1])

Initialize the resulting matrix as a dictionary:

D = {}

for city1, cords1 in cords.items():
    D[city1] = {}
    for city2, cords2 in cords.items():
        D[city1][city2] = dist(cords1, cords2)

D is your resulting matrix

The full source is below along with printed result:

from math import sqrt

cords = {}
cords['Boston'] = (5, 2)
cords['Phoenix'] = (7, 3)
cords['New York'] = (8, 1)

def dist(a, b):
    d = [a[0] - b[0], a[1] - b[1]]
    return sqrt(d[0] * d[0] + d[1] * d[1]) 

D = {}

for city1, cords1 in cords.items():
    D[city1] = {}
    for city2, cords2 in cords.items():
        D[city1][city2] = dist(cords1, cords2)   

for city1, v in D.items():
    for city2, d in v.items():
        print city1, city2, d

Results:

Boston Boston 0.0
Boston New York 3.16227766017
Boston Phoenix 2.2360679775
New York Boston 3.16227766017
New York New York 0.0
New York Phoenix 2.2360679775
Phoenix Boston 2.2360679775
Phoenix New York 2.2360679775
Phoenix Phoenix 0.0
Answered By: pkacprzak

There’s the function in scipy:
scipy.spatial.distance.cdist()

Answered By: Maassa

if you don’t want to use scipy you can exploit list comprehension in this way:

dist = lambda p1, p2: sqrt(((p1-p2)**2).sum())
dm = np.asarray([[dist(p1, p2) for p2 in xy_list] for p1 in xy_list])
Answered By: francesco lc
data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)

n_df=(df.values)
n_df

(df.values).shape

matrix=np.zeros(((df.values).shape[0],(df.values).shape[0]))
matrix


for i in range((df.values).shape[0]):
    for j in range((df.values).shape[0]):
        matrix[i,j]=np.sqrt(np.sum((n_df[i]-n_df[j])**2))
        #print('i',i,'j',j)


print(matrix)
Answered By: Surya Gaur

Refer

import pandas as pd
import numpy as np

data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)
x, y = df.xcord.to_numpy(), df.ycord.to_numpy()
x_y = df.values
%%timeit
pd.DataFrame(
    np.hypot(
        np.subtract.outer(x, x),
        np.subtract.outer(y, y)
    ),
    index=df.index, columns=df.index
)
# 32.9 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
pd.DataFrame(distance_matrix(x_y, x_y), index=df.index, columns=df.index)
# 49.8 µs ± 330 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Also compared to normal custom written sqrt methods, hypot is more resistant to overflows and underflows

Underflow

i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0

Overflow

i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf

No Underflow

i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200

No Overflow

i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200
Answered By: eroot163pi

This is a pure Python and numpy solution for generating a distance matrix.

Redundant computations can skipped (since distance is symmetric, distance(a,b) is the same as distance(b,a) and there’s no need to compute the distance twice).

data = [[5, 7], [7, 3], [8, 1]]
cities = ['Boston', 'Phoenix', 'New York']

# Euclidean distance between two points
from math import sqrt
dist = lambda a,b: sqrt((a[0]-b[0])**2+(a[1]-b[1])**2)

import numpy as np
n = len(data)
dist_matrix = np.zeros((n,n))    # initialize distance matrix to a square of zeros
for i in range(n):
    for j in range(i, n):
        dist_matrix[i,j] = dist(data[i], data[j])
        dist_matrix[j,i] = dist_matrix[i,j]       # for the symmetric part, no computation

Now dist_matrix[i,j] is the distance between city[i] and city[j].

Answered By: user2314737
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.