K-nearest points from two dataframes with GeoPandas

Question:

GeoPandas uses shapely under the hood. To get the nearest neighbor I saw the use of nearest_points from shapely. However, this approach does not include k-nearest points.

I needed to compute distances to nearest points from to GeoDataFrames and insert the distance into the GeoDataFrame containing the “from this point” data.

This is my approach using GeoSeries.distance() without using another package or library. Note that when k == 1 the returned value essentially shows the distance to the nearest point. There is also a GeoPandas-only solution for nearest point by @cd98 which inspired my approach.

This works well for my data, but I wonder if there is a better or faster approach or another benefit to use shapely or sklearn.neighbors?

import pandas as pd
import geopandas as gp

gdf1 > GeoDataFrame with point type geometry column - distance from this point
gdf2 > GeoDataFrame with point type geometry column - distance to this point

def knearest(from_points, to_points, k):
    distlist = to_points.distance(from_points)
    distlist.sort_values(ascending=True, inplace=True) # To have the closest ones first
    return distlist[:k].mean()

# looping through a list of nearest points
for Ks in [1, 2, 3, 4, 5, 10]:
    name = 'dist_to_closest_' + str(Ks) # to set column name
    gdf1[name] = gdf1.geometry.apply(knearest, args=(gdf2, closest_x))
Asked By: raummensch

||

Answers:

yes there is, but first, I must credit the University of Helsinki from automating GIS process, here’s the source code. Here’s how
first, read the data, for example, finding nearest bus stops for each building.

# Filepaths
stops = gpd.read_file('data/pt_stops_helsinki.gpkg')
buildings = read_gdf_from_zip('data/building_points_helsinki.zip')

define the function, here, you can adjust the k_neighbors

from sklearn.neighbors import BallTree
import numpy as np

def get_nearest(src_points, candidates, k_neighbors=1):
    """Find nearest neighbors for all source points from a set of candidate points"""

    # Create tree from the candidate points
    tree = BallTree(candidates, leaf_size=15, metric='haversine')

    # Find closest points and distances
    distances, indices = tree.query(src_points, k=k_neighbors)

    # Transpose to get distances and indices into arrays
    distances = distances.transpose()
    indices = indices.transpose()

    # Get closest indices and distances (i.e. array at index 0)
    # note: for the second closest points, you would take index 1, etc.
    closest = indices[0]
    closest_dist = distances[0]

    # Return indices and distances
    return (closest, closest_dist)


def nearest_neighbor(left_gdf, right_gdf, return_dist=False):
    """
    For each point in left_gdf, find closest point in right GeoDataFrame and return them.

    NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
    """

    left_geom_col = left_gdf.geometry.name
    right_geom_col = right_gdf.geometry.name

    # Ensure that index in right gdf is formed of sequential numbers
    right = right_gdf.copy().reset_index(drop=True)

    # Parse coordinates from points and insert them into a numpy array as RADIANS
    left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
    right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())

    # Find the nearest points
    # -----------------------
    # closest ==> index in right_gdf that corresponds to the closest point
    # dist ==> distance between the nearest neighbors (in meters)

    closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)

    # Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
    closest_points = right.loc[closest]

    # Ensure that the index corresponds the one in left_gdf
    closest_points = closest_points.reset_index(drop=True)

    # Add distance if requested
    if return_dist:
        # Convert to meters from radians
        earth_radius = 6371000  # meters
        closest_points['distance'] = dist * earth_radius

    return closest_points

Do the nearest neighbours analysis

# Find closest public transport stop for each building and get also the distance based on haversine distance
# Note: haversine distance which is implemented here is a bit slower than using e.g. 'euclidean' metric
# but useful as we get the distance between points in meters
closest_stops = nearest_neighbor(buildings, stops, return_dist=True)

now join the from and to data frame

# Rename the geometry of closest stops gdf so that we can easily identify it
closest_stops = closest_stops.rename(columns={'geometry': 'closest_stop_geom'})

# Merge the datasets by index (for this, it is good to use '.join()' -function)
buildings = buildings.join(closest_stops)
Answered By: sutan

The answer above using Automating GIS-processes is really nice but there is an error when converting points as a numpy array as RADIANS. The latitude and longitude are reversed.

left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())

Indeed Points are given with (lat, lon) but the longitude correspond the x-axis of a plan or a sphere and the latitude to the y-axis.

Answered By: Jean-Louis Lamezec

If your data are in grid coordinates, then the approach is a bit leaner, but with one key gotcha.

Building on sutan’s answer and streamlining the block from the Uni Helsinki…

To get multiple neighbors, you edit the k_neighbors argument….and must ALSO hard code vars within the body of the function (see my additions below ‘closest’ and ‘closest_dist’) AND add them to the return statement.

Thus, if you want the 2 closest points, it looks like:

from sklearn.neighbors import BallTree
import numpy as np

def get_nearest(src_points, candidates, k_neighbors=2):
    """
    Find nearest neighbors for all source points from a set of candidate points
    modified from: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html
    """
    

    # Create tree from the candidate points
    tree = BallTree(candidates, leaf_size=15, metric='euclidean')

    # Find closest points and distances
    distances, indices = tree.query(src_points, k=k_neighbors)

    # Transpose to get distances and indices into arrays
    distances = distances.transpose()
    indices = indices.transpose()

    # Get closest indices and distances (i.e. array at index 0)
    # note: for the second closest points, you would take index 1, etc.
    closest = indices[0]
    closest_dist = distances[0]
    closest_second = indices[1] # *manually add per comment above*
    closest_second_dist = distances[1] # *manually add per comment above*

    # Return indices and distances
    return (closest, closest_dist, closest_sec, closest_sec_dist)

The inputs are lists of (x,y) tuples. Thus, since (by question title) your data is in a GeoDataframe:

# easier to read
in_pts = [(row.geometry.x, row.geometry.y) for idx, row in gdf1.iterrows()]
qry_pts = [(row.geometry.x, row.geometry.y) for idx, row in gdf2.iterrows()]

# faster (by about 7X)
in_pts = [(x,y) for x,y in zip(gdf1.geometry.x , gdf1.geometry.y)]
qry_pts =  [(x,y) for x,y in zip(gdf2.geometry.x , gdf2.geometry.y)]

I’m not interested in distances, so instead of commenting out of the function, I run:

idx_nearest, _, idx_2ndnearest, _ = get_nearest(in_pts, qry_pts)

and get two arrays of the same length of in_pts that, respectively, contain index values of the closest and second closest points from the original geodataframe for qry_pts.

Answered By: CreekGeek

Great solution! If you are using Automating GIS-processes solution, make sure to reset the index of buildings geoDataFrame before join (only if you are using a subset of left_gdf).

buildings.insert(0, 'Number', range(0,len(buildings)))

buildings.set_index('Number' , inplace = True)
Answered By: Mohammad_hdr

Based on the answers before I have a all-in-one solution for you which takes two geopandas.DataFrames as input and searches for the nearest k-neighbors.

def get_nearest_neighbors(gdf1, gdf2, k_neighbors=2):
'''
Find k nearest neighbors for all source points from a set of candidate points
modified from: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html    

Parameters
----------
gdf1 : geopandas.DataFrame
    Geometries to search from.
gdf2 : geopandas.DataFrame
    Geoemtries to be searched.
k_neighbors : int, optional
    Number of nearest neighbors. The default is 2.

Returns
-------
gdf_final : geopandas.DataFrame
    gdf1 with distance, index and all other columns from gdf2.
'''

src_points = [(x,y) for x,y in zip(gdf1.geometry.x , gdf1.geometry.y)]
candidates =  [(x,y) for x,y in zip(gdf2.geometry.x , gdf2.geometry.y)]

# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='euclidean')

# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)

# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()

closest_gdfs = []
for k in np.arange(k_neighbors):
    gdf_new = gdf2.iloc[indices[k]].reset_index()
    gdf_new['distance'] =  distances[k]
    gdf_new = gdf_new.add_suffix(f'_{k+1}')
    closest_gdfs.append(gdf_new)
    
closest_gdfs.insert(0,gdf1)    
gdf_final = pd.concat(closest_gdfs,axis=1)

return gdf_final
Answered By: Manuel
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.