How can the Euclidean distance be calculated with NumPy?

Question

I have two points in 3D space:

a = (ax, ay, az)
b = (bx, by, bz)

I want to calculate the distance between them:

dist = sqrt((ax-bx)^2 + (ay-by)^2 + (az-bz)^2)

How do I do this with NumPy? I have:

import numpy
a = numpy.array((ax, ay, az))
b = numpy.array((bx, by, bz))

Asked By: Nathan Fellman

||

Source

Answer 1

Another instance of this problem solving method:

def dist(x,y):   
    return numpy.sqrt(numpy.sum((x-y)**2))

a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))
dist_a_b = dist(a,b)

Answered By: Nathan Fellman

Answer 2

Use numpy.linalg.norm:

dist = numpy.linalg.norm(a-b)

This works because the Euclidean distance is the l2 norm, and the default value of the ord parameter in numpy.linalg.norm is 2.
For more theory, see Introduction to Data Mining:

Answered By: u0b34a0f6ae

Answer 3

A nice one-liner:

dist = numpy.linalg.norm(a-b)

However, if speed is a concern I would recommend experimenting on your machine. I’ve found that using math library’s sqrt with the ** operator for the square is much faster on my machine than the one-liner NumPy solution.

I ran my tests using this simple program:

#!/usr/bin/python
import math
import numpy
from random import uniform

def fastest_calc_dist(p1,p2):
    return math.sqrt((p2[0] - p1[0]) ** 2 +
                     (p2[1] - p1[1]) ** 2 +
                     (p2[2] - p1[2]) ** 2)

def math_calc_dist(p1,p2):
    return math.sqrt(math.pow((p2[0] - p1[0]), 2) +
                     math.pow((p2[1] - p1[1]), 2) +
                     math.pow((p2[2] - p1[2]), 2))

def numpy_calc_dist(p1,p2):
    return numpy.linalg.norm(numpy.array(p1)-numpy.array(p2))

TOTAL_LOCATIONS = 1000

p1 = dict()
p2 = dict()
for i in range(0, TOTAL_LOCATIONS):
    p1[i] = (uniform(0,1000),uniform(0,1000),uniform(0,1000))
    p2[i] = (uniform(0,1000),uniform(0,1000),uniform(0,1000))

total_dist = 0
for i in range(0, TOTAL_LOCATIONS):
    for j in range(0, TOTAL_LOCATIONS):
        dist = fastest_calc_dist(p1[i], p2[j]) #change this line for testing
        total_dist += dist

print total_dist

On my machine, math_calc_dist runs much faster than numpy_calc_dist: 1.5 seconds versus 23.5 seconds.

To get a measurable difference between fastest_calc_dist and math_calc_dist I had to up TOTAL_LOCATIONS to 6000. Then fastest_calc_dist takes ~50 seconds while math_calc_dist takes ~60 seconds.

You can also experiment with numpy.sqrt and numpy.square though both were slower than the math alternatives on my machine.

My tests were run with Python 2.6.6.

Answered By: user118662

Answer 4

You can just subtract the vectors and then innerproduct.

Following your example,

a = numpy.array((xa, ya, za))
b = numpy.array((xb, yb, zb))

tmp = a - b
sum_squared = numpy.dot(tmp.T, tmp)
result = numpy.sqrt(sum_squared)

Answered By: PuercoPop

Answer 5

It can be done like the following. I don’t know how fast it is, but it’s not using NumPy.

from math import sqrt
a = (1, 2, 3) # Data point 1
b = (4, 5, 6) # Data point 2
print sqrt(sum( (a - b)**2 for a, b in zip(a, b)))

Answered By: The Demz

Answer 6

I find a ‘dist’ function in matplotlib.mlab, but I don’t think it’s handy enough.

I’m posting it here just for reference.

import numpy as np
import matplotlib as plt

a = np.array([1, 2, 3])
b = np.array([2, 3, 4])

# Distance between a and b
dis = plt.mlab.dist(a, b)

Answered By: Alan Wang

Answer 7

Use scipy.spatial.distance.euclidean:

from scipy.spatial import distance
a = (1, 2, 3)
b = (4, 5, 6)
dst = distance.euclidean(a, b)

Answered By: Avision

Answer 8

Here’s some concise code for Euclidean distance in Python given two points represented as lists in Python.

def distance(v1,v2): 
    return sum([(x-y)**2 for (x,y) in zip(v1,v2)])**(0.5)

Answered By: Andy Lee

Answer 9

I like np.dot (dot product):

a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))

distance = (np.dot(a-b,a-b))**.5

Answered By: travelingbones

Answer 10

Having a and b as you defined them, you can use also:

distance = np.sqrt(np.sum((a-b)**2))

Answered By: Alejandro Sazo

Answer 11

Calculate the Euclidean distance for multidimensional space:

 import math

 x = [1, 2, 6] 
 y = [-2, 3, 2]

 dist = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(x, y)]))
 5.0990195135927845

Answered By: Gennady Nikitin

Answer 12

I want to expound on the simple answer with various performance notes. np.linalg.norm will do perhaps more than you need:

dist = numpy.linalg.norm(a-b)

Firstly – this function is designed to work over a list and return all of the values, e.g. to compare the distance from pA to the set of points sP:

sP = set(points)
pA = point
distances = np.linalg.norm(sP - pA, ord=2, axis=1.)  # 'distances' is a list

Remember several things:

Python function calls are expensive.
[Regular] Python doesn’t cache name lookups.

So

def distance(pointA, pointB):
    dist = np.linalg.norm(pointA - pointB)
    return dist

isn’t as innocent as it looks.

>>> dis.dis(distance)
  2           0 LOAD_GLOBAL              0 (np)
              2 LOAD_ATTR                1 (linalg)
              4 LOAD_ATTR                2 (norm)
              6 LOAD_FAST                0 (pointA)
              8 LOAD_FAST                1 (pointB)
             10 BINARY_SUBTRACT
             12 CALL_FUNCTION            1
             14 STORE_FAST               2 (dist)

  3          16 LOAD_FAST                2 (dist)
             18 RETURN_VALUE

Firstly – every time we call it, we have to do a global lookup for "np", a scoped lookup for "linalg" and a scoped lookup for "norm", and the overhead of merely calling the function can equate to dozens of python instructions.

Lastly, we wasted two operations on to store the result and reload it for return…

First pass at improvement: make the lookup faster, skip the store

def distance(pointA, pointB, _norm=np.linalg.norm):
    return _norm(pointA - pointB)

We get the far more streamlined:

>>> dis.dis(distance)
  2           0 LOAD_FAST                2 (_norm)
              2 LOAD_FAST                0 (pointA)
              4 LOAD_FAST                1 (pointB)
              6 BINARY_SUBTRACT
              8 CALL_FUNCTION            1
             10 RETURN_VALUE

The function call overhead still amounts to some work, though. And you’ll want to do benchmarks to determine whether you might be better doing the math yourself:

def distance(pointA, pointB):
    return (
        ((pointA.x - pointB.x) ** 2) +
        ((pointA.y - pointB.y) ** 2) +
        ((pointA.z - pointB.z) ** 2)
    ) ** 0.5  # fast sqrt

On some platforms, **0.5 is faster than math.sqrt. Your mileage may vary.

**** Advanced performance notes.

Why are you calculating distance? If the sole purpose is to display it,

 print("The target is %.2fm away" % (distance(a, b)))

move along. But if you’re comparing distances, doing range checks, etc., I’d like to add some useful performance observations.

Let’s take two cases: sorting by distance or culling a list to items that meet a range constraint.

# Ultra naive implementations. Hold onto your hat.

def sort_things_by_distance(origin, things):
    return things.sort(key=lambda thing: distance(origin, thing))

def in_range(origin, range, things):
    things_in_range = []
    for thing in things:
        if distance(origin, thing) <= range:
            things_in_range.append(thing)

The first thing we need to remember is that we are using Pythagoras to calculate the distance (dist = sqrt(x^2 + y^2 + z^2)) so we’re making a lot of sqrt calls. Math 101:

dist = root ( x^2 + y^2 + z^2 )
:.
dist^2 = x^2 + y^2 + z^2
and
sq(N) < sq(M) iff M > N
and
sq(N) > sq(M) iff N > M
and
sq(N) = sq(M) iff N == M

In short: until we actually require the distance in a unit of X rather than X^2, we can eliminate the hardest part of the calculations.

# Still naive, but much faster.

def distance_sq(left, right):
    """ Returns the square of the distance between left and right. """
    return (
        ((left.x - right.x) ** 2) +
        ((left.y - right.y) ** 2) +
        ((left.z - right.z) ** 2)
    )

def sort_things_by_distance(origin, things):
    return things.sort(key=lambda thing: distance_sq(origin, thing))

def in_range(origin, range, things):
    things_in_range = []

    # Remember that sqrt(N)**2 == N, so if we square
    # range, we don't need to root the distances.
    range_sq = range**2

    for thing in things:
        if distance_sq(origin, thing) <= range_sq:
            things_in_range.append(thing)

Great, both functions no-longer do any expensive square roots. That’ll be much faster, but before you go further, check yourself: why did sort_things_by_distance need a "naive" disclaimer both times above? Answer at the very bottom (*a1).

We can improve in_range by converting it to a generator:

def in_range(origin, range, things):
    range_sq = range**2
    yield from (thing for thing in things
                if distance_sq(origin, thing) <= range_sq)

This especially has benefits if you are doing something like:

if any(in_range(origin, max_dist, things)):
    ...

But if the very next thing you are going to do requires a distance,

for nearby in in_range(origin, walking_distance, hotdog_stands):
    print("%s %.2fm" % (nearby.name, distance(origin, nearby)))

consider yielding tuples:

def in_range_with_dist_sq(origin, range, things):
    range_sq = range**2
    for thing in things:
        dist_sq = distance_sq(origin, thing)
        if dist_sq <= range_sq: yield (thing, dist_sq)

This can be especially useful if you might chain range checks (‘find things that are near X and within Nm of Y’, since you don’t have to calculate the distance again).

But what about if we’re searching a really large list of things and we anticipate a lot of them not being worth consideration?

There is actually a very simple optimization:

def in_range_all_the_things(origin, range, things):
    range_sq = range**2
    for thing in things:
        dist_sq = (origin.x - thing.x) ** 2
        if dist_sq <= range_sq:
            dist_sq += (origin.y - thing.y) ** 2
            if dist_sq <= range_sq:
                dist_sq += (origin.z - thing.z) ** 2
                if dist_sq <= range_sq:
                    yield thing

Whether this is useful will depend on the size of ‘things’.

def in_range_all_the_things(origin, range, things):
    range_sq = range**2
    if len(things) >= 4096:
        for thing in things:
            dist_sq = (origin.x - thing.x) ** 2
            if dist_sq <= range_sq:
                dist_sq += (origin.y - thing.y) ** 2
                if dist_sq <= range_sq:
                    dist_sq += (origin.z - thing.z) ** 2
                    if dist_sq <= range_sq:
                        yield thing
    elif len(things) > 32:
        for things in things:
            dist_sq = (origin.x - thing.x) ** 2
            if dist_sq <= range_sq:
                dist_sq += (origin.y - thing.y) ** 2 + (origin.z - thing.z) ** 2
                if dist_sq <= range_sq:
                    yield thing
    else:
        ... just calculate distance and range-check it ...

And again, consider yielding the dist_sq. Our hotdog example then becomes:

# Chaining generators
info = in_range_with_dist_sq(origin, walking_distance, hotdog_stands)
info = (stand, dist_sq**0.5 for stand, dist_sq in info)
for stand, dist in info:
    print("%s %.2fm" % (stand, dist))

(*a1: sort_things_by_distance’s sort key calls distance_sq for every single item, and that innocent looking key is a lambda, which is a second function that has to be invoked…)

Answered By: kfsone

Answer 13

For anyone interested in computing multiple distances at once, I’ve done a little comparison using perfplot (a small project of mine).

The first advice is to organize your data such that the arrays have dimension (3, n) (and are C-contiguous obviously). If adding happens in the contiguous first dimension, things are faster, and it doesn’t matter too much if you use sqrt-sum with axis=0, linalg.norm with axis=0, or

a_min_b = a - b
numpy.sqrt(numpy.einsum('ij,ij->j', a_min_b, a_min_b))

which is, by a slight margin, the fastest variant. (That actually holds true for just one row as well.)

The variants where you sum up over the second axis, axis=1, are all substantially slower.

Code to reproduce the plot:

import numpy
import perfplot
from scipy.spatial import distance


def linalg_norm(data):
    a, b = data[0]
    return numpy.linalg.norm(a - b, axis=1)


def linalg_norm_T(data):
    a, b = data[1]
    return numpy.linalg.norm(a - b, axis=0)


def sqrt_sum(data):
    a, b = data[0]
    return numpy.sqrt(numpy.sum((a - b) ** 2, axis=1))


def sqrt_sum_T(data):
    a, b = data[1]
    return numpy.sqrt(numpy.sum((a - b) ** 2, axis=0))


def scipy_distance(data):
    a, b = data[0]
    return list(map(distance.euclidean, a, b))


def sqrt_einsum(data):
    a, b = data[0]
    a_min_b = a - b
    return numpy.sqrt(numpy.einsum("ij,ij->i", a_min_b, a_min_b))


def sqrt_einsum_T(data):
    a, b = data[1]
    a_min_b = a - b
    return numpy.sqrt(numpy.einsum("ij,ij->j", a_min_b, a_min_b))


def setup(n):
    a = numpy.random.rand(n, 3)
    b = numpy.random.rand(n, 3)
    out0 = numpy.array([a, b])
    out1 = numpy.array([a.T, b.T])
    return out0, out1


b = perfplot.bench(
    setup=setup,
    n_range=[2 ** k for k in range(22)],
    kernels=[
        linalg_norm,
        linalg_norm_T,
        scipy_distance,
        sqrt_sum,
        sqrt_sum_T,
        sqrt_einsum,
        sqrt_einsum_T,
    ],
    xlabel="len(x), len(y)",
)
b.save("norm.png")

Answered By: Nico Schlömer

Answer 14

import numpy as np
from scipy.spatial import distance
input_arr = np.array([[0,3,0],[2,0,0],[0,1,3],[0,1,2],[-1,0,1],[1,1,1]]) 
test_case = np.array([0,0,0])
dst=[]
for i in range(0,6):
    temp = distance.euclidean(test_case,input_arr[i])
    dst.append(temp)
print(dst)

Answered By: Ankur Nadda

Answer 15

import math

dist = math.hypot(math.hypot(xa-xb, ya-yb), za-zb)

Answered By: Jonas De Schouwer

Answer 16

You can easily use the formula

distance = np.sqrt(np.sum(np.square(a-b)))

which does actually nothing more than using Pythagoras’ theorem to calculate the distance, by adding the squares of Δx, Δy and Δz and rooting the result.

Answered By: Jonas De Schouwer

Answer 17

Find difference of two matrices first. Then, apply element wise multiplication with numpy’s multiply command. After then, find summation of the element wise multiplied new matrix. Finally, find square root of the summation.

def findEuclideanDistance(a, b):
    euclidean_distance = a - b
    euclidean_distance = np.sum(np.multiply(euclidean_distance, euclidean_distance))
    euclidean_distance = np.sqrt(euclidean_distance)
    return euclidean_distance

Answered By: johncasey

Answer 18

Starting Python 3.8, the math module directly provides the dist function, which returns the euclidean distance between two points (given as tuples or lists of coordinates):

from math import dist

dist((1, 2, 6), (-2, 3, 2)) # 5.0990195135927845

And if you’re working with lists:

dist([1, 2, 6], [-2, 3, 2]) # 5.0990195135927845

Answered By: Xavier Guihot

Answer 19

Since Python 3.8

Since Python 3.8 the math module includes the function math.dist().
See here https://docs.python.org/3.8/library/math.html#math.dist.

math.dist(p1, p2)
Return the Euclidean distance between two points p1 and p2,
each given as a sequence (or iterable) of coordinates.

import math
print( math.dist( (0,0),   (1,1)   )) # sqrt(2) -> 1.4142
print( math.dist( (0,0,0), (1,1,1) )) # sqrt(3) -> 1.7321

Answered By: ePi272314

Answer 20

With Python 3.8, it’s very easy.

https://docs.python.org/3/library/math.html#math.dist

math.dist(p, q)

Return the Euclidean distance between two points p and q, each given
as a sequence (or iterable) of coordinates. The two points must have
the same dimension.

Roughly equivalent to:

sqrt(sum((px - qx) ** 2.0 for px, qx in zip(p, q)))

Answered By: hakki

Answer 21

import numpy as np
# any two python array as two points
a = [0, 0]
b = [3, 4]

You first change list to numpy array and do like this: print(np.linalg.norm(np.array(a) - np.array(b))). Second method directly from python list as: print(np.linalg.norm(np.subtract(a,b)))

Answered By: Uddhav P. Gautam

Answer 22

The other answers work for floating point numbers, but do not correctly compute the distance for integer dtypes which are subject to overflow and underflow. Note that even scipy.distance.euclidean has this issue:

>>> a1 = np.array([1], dtype='uint8')
>>> a2 = np.array([2], dtype='uint8')
>>> a1 - a2
array([255], dtype=uint8)
>>> np.linalg.norm(a1 - a2)
255.0
>>> from scipy.spatial import distance
>>> distance.euclidean(a1, a2)
255.0

This is common, since many image libraries represent an image as an ndarray with dtype="uint8". This means that if you have a greyscale image which consists of very dark grey pixels (say all the pixels have color #000001) and you’re diffing it against black image (#000000), you can end up with x-y consisting of 255 in all cells, which registers as the two images being very far apart from each other. For unsigned integer types (e.g. uint8), you can safely compute the distance in numpy as:

np.linalg.norm(np.maximum(x, y) - np.minimum(x, y))

For signed integer types, you can cast to a float first:

np.linalg.norm(x.astype("float") - y.astype("float"))

For image data specifically, you can use opencv’s norm method:

import cv2
cv2.norm(x, y, cv2.NORM_L2)

Answered By: Lucas Wiman

Answer 23

What’s the best way to do this with NumPy, or with Python in general? I have:

Well best way would be safest and also the fastest

I would suggest hypot usage for reliable results for chances of underflow and overflow are very little compared to writing own sqroot calculator

Lets see math.hypot, np.hypot vs vanilla np.sqrt(np.sum((np.array([i, j, k])) ** 2, axis=1))

i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# 1.7320508075688773e+200

np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square

Speed wise math.hypot look better

%%timeit
math.hypot(i, j, k)
# 100 ns ± 1.05 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

%%timeit
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# 6.41 µs ± 33.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Underflow

i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0

Overflow

i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf

No Underflow

i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200

No Overflow

i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200

Refer

Answered By: eroot163pi

Answer 24

The fastest solution I could come up with for large number of distances is using numexpr. On my machine it is faster than using numpy einsum:

import numexpr as ne
import numpy as np

a_min_b=a-b
np.sqrt(ne.evaluate("sum((a_min_b)**2,axis=1)"))

Answered By: Okapi575

Answer 25

If you want something more explicit you can easily write the formula like this:

np.sqrt(np.sum((a-b)**2))

Even with arrays of 10_000_000 elements this still runs at 0.1s on my machine.

Answered By: WireInTheGhost

Answer 26

1. SciPy’s vectorized `cdist()` for Euclidean distance matrix

@Nico Schlömer‘s benchmarks show scipy’s euclidean() function to be much slower than its numpy counterparts. The reason is that it’s meant to work on a pair of points, not an array of points; thus not vectorized. Also, his benchmark uses code to find the Euclidean distances between arrays of equal length.

If you need to compute the Euclidean distance matrix between each pair of points from two collections of inputs, then there is another SciPy function, cdist(), that is much faster than numpy.

Consider the following example where a contains 3 points and b contains 2 points. SciPy’s cdist() computes the Euclidean distances between every point in a to every point in b, so in this example, it would return a 3×2 matrix.

import numpy as np
from scipy.spatial import distance

a = [(1, 2, 3), (3, 4, 5), (2, 3, 6)]
b = [(1, 2, 3), (4, 5, 6)]

dsts1 = distance.cdist(a, b)

# array([[0.        , 5.19615242],
#        [3.46410162, 1.73205081],
#        [3.31662479, 2.82842712]])

It is especially useful if we have a collection of points and we want to find the closest distance to each point other than itself; a common use-case is in natural language processing. For example, to compute the Euclidean distances between every pair of points in a collection, distance.cdist(a, a) does the job. Since the distance from a point to itself is 0, the diagonals of this matrix will be all zero.

The same task can be performed with numpy-only methods using broadcasting. We simply need to add another dimension to one of the arrays.

# using `linalg.norm`
dsts2 = np.linalg.norm(np.array(a)[:, None] - b, axis=-1)

# using a `sqrt` + `sum` + `square`
dsts3 = np.sqrt(np.sum((np.array(a)[:, None] - b)**2, axis=-1))

# equality check
np.allclose(dsts1, dsts2) and np.allclose(dsts1, dsts3)        # True

As mentioned earlier, cdist() is much faster than the numpy counterparts. The following perfplot shows as much.¹

2. Scikit-learn’s `euclidean_distances()`

Scikit-learn is a pretty big library so unless you’re not using it for something else, it doesn’t make much sense to import it only for Euclidean distance computation but for completeness, it also has euclidean_distances(), paired_distances() and pairwise_distances() methods that can be used to compute Euclidean distances. It has other convenient pairwise distance computation methods worth checking out.

One useful thing about scikit-learn’s methods is that it can handle sparse matrices as is, whereas scipy/numpy will need to have sparse matrices converted into arrays to perform computation so depending on the size of the data, scikit-learn’s methods may be the only function that runs.

An example:

from scipy import sparse
from sklearn.metrics import pairwise

A = sparse.random(1_000_000, 3)
b = [(1, 2, 3), (4, 5, 6)]

dsts = pairwise.euclidean_distances(A, b)

¹ The code used to produce the perfplot:

import numpy as np
from scipy.spatial import distance
import perfplot
import matplotlib.pyplot as plt

def sqrt_sum(arr):
    return np.sqrt(np.sum((arr[:, None] - arr) ** 2, axis=-1))

def linalg_norm(arr):
    return np.linalg.norm(arr[:, None] - arr, axis=-1)

def scipy_cdist(arr):
    return distance.cdist(arr, arr)

perfplot.plot(
    setup=lambda n: np.random.rand(n, 3),
    n_range=[2 ** k for k in range(14)],
    kernels=[sqrt_sum, scipy_cdist, linalg_norm],
    title="Euclidean distance between arrays of 3-D points",
    xlabel="len(x), len(y)",
    equality_check=np.allclose
);

Answered By: cottontail

How can the Euclidean distance be calculated with NumPy?

Question:

Answers:

Since Python 3.8

Speed wise math.hypot look better

Underflow

Overflow

No Underflow

No Overflow

1. SciPy’s vectorized `cdist()` for Euclidean distance matrix

2. Scikit-learn’s `euclidean_distances()`

How can the Euclidean distance be calculated with NumPy?

Question:

Answers:

Since Python 3.8

Speed wise math.hypot look better

Underflow

Overflow

No Underflow

No Overflow

1. SciPy’s vectorized cdist() for Euclidean distance matrix

2. Scikit-learn’s euclidean_distances()

1. SciPy’s vectorized `cdist()` for Euclidean distance matrix

2. Scikit-learn’s `euclidean_distances()`