How to use geopanda or shapely to find nearest point in same geodataframe

Question:

I have a geodataframe showing ~25 locations represented as point geometry. I am trying to come up with a script that goes through each point, identifies the nearest location and returns the name of the nearest location and the distance.

I can easily do this if I have different geodataframes using nearest_points(geom1, geom2) in the shapely.ops library. However all my locations are stored in one geodataframe. I am trying to loop through and that is where I am having trouble

here is my sample file:

geofile = gpd.GeoDataFrame([[0, 'location A', Point(55, 55)],
                            [1, 'location B', Point(66, 66)],
                            [2, 'Location C', Point(99, 99)],
                            [3, 'Location D', Point(11, 11)]],
                           columns=['ID','Location','geometry'])

Here is the loop I am creating to no avail.

for index, row in geofile.iterrows():
    nearest_geoms=nearest_points(row, geofile)
    print('location:' + nearest_geoms[0])
    print('nearest:' + nearest_geoms[1])
    print('-------')

I am getting this error:

AttributeError: 'Series' object has no attribute '_geom'

However I think my problem is beyond the error cause somehow I have to exclude the row I am looping through cause that will automatically return as the closest location since it is that location.

My end result for one location would be the following:

([0,'location A','location B', '5 miles', Point(55,55)], columns=['ID','Location','Nearest', 'Distance',geometry'])
Asked By: rzt101

||

Answers:

Shapely’s nearest_points function compares shapely geometries. To compare a single Point geometry against multiple other Point geometries, you can use .unary_union to compare against the resulting MultiPoint geometry. And yes, at each row operation, drop the respective point so it is not compared against itself.

import geopandas as gpd
from shapely.geometry import Point
from shapely.ops import nearest_points

df = gpd.GeoDataFrame([[0, 'location A', Point(55,55)], 
                       [1, 'location B', Point(66,66)],
                       [2, 'Location C', Point(99,99)],
                       [3, 'Location D' ,Point(11,11)]], 
                      columns=['ID','Location','geometry'])
df.insert(3, 'nearest_geometry', None)

for index, row in df.iterrows():
    point = row.geometry
    multipoint = df.drop(index, axis=0).geometry.unary_union
    queried_geom, nearest_geom = nearest_points(point, multipoint)
    df.loc[index, 'nearest_geometry'] = nearest_geom

Resulting in

    ID  Location    geometry        nearest_geometry
0   0   location A  POINT (55 55)   POINT (66 66)
1   1   location B  POINT (66 66)   POINT (55 55)
2   2   Location C  POINT (99 99)   POINT (66 66)
3   3   Location D  POINT (11 11)   POINT (55 55)

enter image description here

Answered By: Christoph Rieke

The following approach using sklearn.neighbors.NearestNeighbors accomplishes this task with two lines of code and scales rather well (both in terms of the number of points and the number of neighbors):

import numpy as np
import pandas as pd
import geopandas as gpd
from sklearn.neighbors import NearestNeighbors

N_POINTS = 10_000
N_NEIGHBORS = 10

# generate larger dataframe with random points:
np.random.seed(23)
acoords = np.random.randint(0, 1000, (N_POINTS, 2))
df = gpd.GeoDataFrame({"ID": range(N_POINTS)}, geometry=gpd.points_from_xy(acoords[:, 0], acoords[:, 1]))

# 2d numpy array of the coordinates
coords = np.array(df.geometry.map(lambda p: [p.x, p.y]).tolist())

# "train"/initialize the NearestNeighbors model 
# NOTE: N_NEIGHBORS + 1 since we are dropping the nearest point 
#       (which is each point itself with distance 0)
knn = NearestNeighbors(n_neighbors=N_NEIGHBORS + 1, algorithm='kd_tree').fit(coords)
# retrieve neighbors (distance and index)
knn_dist, knn_idx = knn.kneighbors(coords)

# add results to dataframe:
df[list(map("NEIGHBOR_{}".format, range(1, N_NEIGHBORS + 1)))] = 
        df.geometry.values.to_numpy()[knn_idx[:, 1:]]

print(df)

Result:

        ID                 geometry       NEIGHBOR_1  ...       NEIGHBOR_8  
0        0  POINT (595.000 742.000)  POINT (597 737)  ...  POINT (592 756)   
1        1   POINT (40.000 969.000)   POINT (40 971)  ...   POINT (27 961)   
...    ...                      ...              ...  ...              ...   
9998  9998   POINT (38.000 508.000)   POINT (34 507)  ...   POINT (50 516)   
9999  9999  POINT (891.000 936.000)  POINT (887 931)  ...  POINT (876 929)   

           NEIGHBOR_9      NEIGHBOR_10  
0     POINT (598 727)  POINT (606 730)  
1      POINT (31 954)   POINT (37 987)  
...               ...              ...  
9998   POINT (29 496)   POINT (23 511)  
9999  POINT (908 930)  POINT (901 951)  

[10000 rows x 12 columns]

Older/obsolete answers:

Here is another approach based on scipy.spatial.distance.cdist. The iterrows is avoided by using numpy masked arrays.

import geopandas as gpd
from scipy.spatial import distance
import numpy.ma as ma
from shapely.geometry import Point
import numpy as np

df = gpd.GeoDataFrame([[0, 'location A', Point(55,55)], 
                       [1, 'location B', Point(66,66)],
                       [2, 'Location C', Point(99,99)],
                       [3, 'Location D' ,Point(11,11)]], 
                      columns=['ID','Location','geometry'])

coords = np.stack(df.geometry.apply(lambda x: [x.x, x.y]))
distance_matrix = ma.masked_where((dist := distance.cdist(*[coords] * 2)) == 0, dist)
df["closest_ID"] = np.argmin(distance_matrix, axis=0)
df = df.join(df.set_index("ID").geometry.rename("nearest_geometry"), on="closest_ID")
df.drop("closest_ID", axis=1)

# Out:
   ID    Location               geometry           nearest_geometry
0   0  location A  POINT (55.000 55.000)  POINT (66.00000 66.00000)
1   1  location B  POINT (66.000 66.000)  POINT (55.00000 55.00000)
2   2  Location C  POINT (99.000 99.000)  POINT (66.00000 66.00000)
3   3  Location D  POINT (11.000 11.000)  POINT (55.00000 55.00000)

Generalization for multiple neighbors

Since the distance_matrix contains the complete information on distances between all pairs of points, it is easy to generalize this approach to arbitrary numbers of neighbors. For example, if we are interested in finding the N_NEAREST = 2 neighbors for each point, we can sort the distance matrix (using np.argsort, instead of picking the np.argmin, as before) and select the corresponding number of columns:

nearest_id_cols = list(map("nearest_id_{}".format, range(1, N_NEAREST + 1)))
nearest_geom_cols = list(map("nearest_geometry_{}".format, range(1, N_NEAREST + 1)))
df[nearest_id_cols] = np.argsort(distance_matrix, axis=1)[:, :N_NEAREST]
df[nearest_geom_cols] = df[nearest_id_cols].applymap(
                             lambda x: df.set_index("ID").geometry[x])

# out:
   ID    Location                  geometry  nearest_id_1  nearest_id_2  
0   0  location A  POINT (55.00000 55.00000)             1             2   
1   1  location B  POINT (66.00000 66.00000)             0             2   
2   2  Location C  POINT (99.00000 99.00000)             1             0   
3   3  Location D  POINT (11.00000 11.00000)             0             1   

  nearest_geometry_1 nearest_geometry_2  
0       POINT (66 66)       POINT (99 99)  
1       POINT (55 55)       POINT (99 99)  
2       POINT (66 66)       POINT (55 55)  
3       POINT (55 55)       POINT (66 66)  
Answered By: mcsoini
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.