How to improve performance – Merge two dataframes by closest geodetic distance

Question:

I have two dataframes, one radar which represents data on an equispaced grid with columns for longitude, latitude and height value, and one ice that has some information related to satellite observations, including the latitude and longitude of the observation. I want to merge the two so I can get ice with the ‘height’ column from radar, based on the geodetic distance point from each ice row to the closest radar point.

I’m currently doing it like this:

from geopy.distance import geodesic
import pandas as pd
def get_distance(out):
    global radar
    dists = radar['latlon'].apply(lambda x: geodesic(out['latlon'], x).km)
    out['dist to radar']=min(dists)
    out['rate_yr_radar']=radar.loc[dists.idxmin()]['rate_yr_radar']
    return out

ICEvsRadar=ice.apply(get_distance, axis=1)

But it’s very slow, I have around 200 points in my ice dataframe and around 50000 on the radar one. Is a slow performance to be expected based on the computational cost of calculating each distance, or could I improve something in how I apply the function?

edit: uploaded the example data on https://wetransfer.com/downloads/284036652e682a3e665994d360a3068920221203230651/5842f2

The code takes around 25 minutes to run, ice has lon, lat and latlon fields and is 180 rows long, and radar has 50000 rows with lon, lat, latlon and rate_yr_radar fields

Edit: Used the help from the comment by Atanas, ended up solving it like this:

import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree

#building tree
Tree = BallTree(np.deg2rad(radar[['lat', 'lon']].values), metric='haversine')

#querying the nearest neighbour
distance, index = Tree.query(np.deg2rad(ice.loc[:, ["lat","lon"]]))

#getting relevant data from radar to merge with ice
reduced_radar = radar.loc[np.concatenate(index), ["rate_yr_radar"]]
reduced_radar['dist to radar']=np.concatenate(distance)*6371 #get correct distance in km
reduced_radar = reduced_radar.reset_index().rename({"index": "index_from_radar"}, axis=1)

#joining data
ice = ice.join(reduced_radar)

It went from a 30 minute runtime to 60 milliseconds!

Asked By: Feva

||

Answers:

This code takes less than a second on my machine. Probably not working around equator/greenwich

import pandas as pd
import numpy as np
from scipy.spatial import KDTree

#reading data
radar = pd.read_csv("radar.csv")
ice = pd.read_csv("ice.csv")

#extracting points data
pts = np.array(radar.loc[:, ["lon", "lat"]])

#building tree
Tree = KDTree(pts)

#querying the nearest neighbour
distance, index = Tree.query(ice.loc[:, ["lon", "lat"]])

#getting relevant data from ice
reduced_radar = radar.loc[index, ["rate_yr_radar"]]
reduced_radar = reduced_radar.reset_index().rename({"index": "index_from_radar"}, axis=1)

#joining data
ice = ice.join(reduced_radar)

Alternatively one could look at https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.sjoin_nearest.html

Answered By: Atanas Atanasov