Finding index of nearest point in numpy arrays of x and y coordinates

Question:

I have two 2d numpy arrays: x_array contains positional information in the x-direction, y_array contains positions in the y-direction.

I then have a long list of x,y points.

For each point in the list, I need to find the array index of the location (specified in the arrays) which is closest to that point.

I have naively produced some code which works, based on this question:
Find nearest value in numpy array

i.e.

import time
import numpy

def find_index_of_nearest_xy(y_array, x_array, y_point, x_point):
    distance = (y_array-y_point)**2 + (x_array-x_point)**2
    idy,idx = numpy.where(distance==distance.min())
    return idy[0],idx[0]

def do_all(y_array, x_array, points):
    store = []
    for i in xrange(points.shape[1]):
        store.append(find_index_of_nearest_xy(y_array,x_array,points[0,i],points[1,i]))
    return store


# Create some dummy data
y_array = numpy.random.random(10000).reshape(100,100)
x_array = numpy.random.random(10000).reshape(100,100)

points = numpy.random.random(10000).reshape(2,5000)

# Time how long it takes to run
start = time.time()
results = do_all(y_array, x_array, points)
end = time.time()
print 'Completed in: ',end-start

I’m doing this over a large dataset and would really like to speed it up a bit.
Can anyone optimize this?

Thanks.


UPDATE: SOLUTION following suggestions by @silvado and @justin (below)

# Shoe-horn existing data for entry into KDTree routines
combined_x_y_arrays = numpy.dstack([y_array.ravel(),x_array.ravel()])[0]
points_list = list(points.transpose())


def do_kdtree(combined_x_y_arrays,points):
    mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
    dist, indexes = mytree.query(points)
    return indexes

start = time.time()
results2 = do_kdtree(combined_x_y_arrays,points_list)
end = time.time()
print 'Completed in: ',end-start

This code above sped up my code (searching for 5000 points in 100×100 matrices) by 100 times. Interestingly, using scipy.spatial.KDTree (instead of scipy.spatial.cKDTree) gave comparable timings to my naive solution, so it is definitely worth using the cKDTree version…

Asked By: Pete W

||

Answers:

If you can massage your data into the right format, a fast way to go is to use the methods in scipy.spatial.distance:

http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

In particular pdist and cdist provide fast ways to calculate pairwise distances.

Answered By: JoshAdel

scipy.spatial also has a k-d tree implementation: scipy.spatial.KDTree.

The approach is generally to first use the point data to build up a k-d tree. The computational complexity of that is on the order of N log N, where N is the number of data points. Range queries and nearest neighbour searches can then be done with log N complexity. This is much more efficient than simply cycling through all points (complexity N).

Thus, if you have repeated range or nearest neighbor queries, a k-d tree is highly recommended.

Answered By: silvado

Here is a scipy.spatial.KDTree example

In [1]: from scipy import spatial

In [2]: import numpy as np

In [3]: A = np.random.random((10,2))*100

In [4]: A
Out[4]:
array([[ 68.83402637,  38.07632221],
       [ 76.84704074,  24.9395109 ],
       [ 16.26715795,  98.52763827],
       [ 70.99411985,  67.31740151],
       [ 71.72452181,  24.13516764],
       [ 17.22707611,  20.65425362],
       [ 43.85122458,  21.50624882],
       [ 76.71987125,  44.95031274],
       [ 63.77341073,  78.87417774],
       [  8.45828909,  30.18426696]])

In [5]: pt = [6, 30]  # <-- the point to find

In [6]: A[spatial.KDTree(A).query(pt)[1]] # <-- the nearest point 
Out[6]: array([  8.45828909,  30.18426696])

#how it works!
In [7]: distance,index = spatial.KDTree(A).query(pt)

In [8]: distance # <-- The distances to the nearest neighbors
Out[8]: 2.4651855048258393

In [9]: index # <-- The locations of the neighbors
Out[9]: 9

#then 
In [10]: A[index]
Out[10]: array([  8.45828909,  30.18426696])
Answered By: efirvida

Search methods have two phases:

  1. build a search structure, e.g. a KDTree, from the npt data points (your x y)
  2. lookup nq query points.

Different methods have different build times, and different query times.
Your choice will depend a lot on npt and nq:
scipy cdist
has build time 0, but query time ~ npt * nq.
KDTree build times are complicated,
lookups are very fast, ~ ln npt * nq.

On a regular (Manhatten) grid, you can do much better: see (ahem)
find-nearest-value-in-numpy-array.

A little
testbench:
: building a KDTree of 5000 × 5000 2d points takes about 30 seconds,
then queries take microseconds;
scipy cdist
25 million × 20 points (all pairs, 4G) takes about 5 seconds, on my old iMac.

Answered By: denis

I have been trying to follow along with this, but new to Jupyter Notebooks, Python and the various tools being discussed here, but I have managed to get some way down the road I’m travelling.

BURoute = pd.read_csv('C:/Users/andre/BUKP_1m.csv', header=None)
NGEPRoute = pd.read_csv('c:/Users/andre/N1-06.csv', header=None)

I create a combined XY array from my BURoute dataframe

combined_x_y_arrays = BURoute.iloc[:,[0,1]]

And I create the points with the following command

points = NGEPRoute.iloc[:,[0,1]]

I then do the KDTree magic

def do_kdtree(combined_x_y_arrays, points): 
    mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
    dist, indexes = mytree.query(points)
    return indexes

results2 = do_kdtree(combined_x_y_arrays, points)

This gives me an array of the indexes. I’m now trying to figure out how to calculate the distance between the points and the indexed points in the results array.

Answered By: Andrew Lindsay
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.