Efficiently merge GeoDataFrames if Polygon from one contains Point from second
Question:
i have two GeoDataFrames
gdf_point:
Unnamed: 0 latitude longitude geometry
0 0 50.410203 7.236583 POINT (7.23658 50.41020)
1 1 51.303545 7.263082 POINT (7.26308 51.30354)
2 2 50.114965 8.672785 POINT (8.67278 50.11496)
and gdf_poly:
Unnamed: 0 Id geometry
0 0 301286 POLYGON ((9.67079 49.86762, 9.67079 49.86987, ...
1 1 302258 POLYGON ((9.67137 54.75650, 9.67137 54.75874, ...
2 2 302548 POLYGON ((9.66808 48.21535, 9.66808 48.21760, ...
I want to match if a point from gdf_point is contained by any of the polygons of gdf_poly, if yes i want the Id of that polygon to be added to the corresponding row of gdf_point.
Here is my current code:
COUNTER = 0
def f(x, gdf_poly, df_new_point):
global COUNTER
for row in gdf_poly.itertuples():
geom = getattr(row, 'geometry')
id = getattr(row, 'Id')
if geom.contains(x):
print('True')
df_new_point.loc[COUNTER, 'Id'] = id
COUNTER = COUNTER + 1
df_new_point = gdf_point
gdf_point['geometry'].apply(lambda x: f(x, gdf_poly, df_new_point))
This works and does what i want it to do. But the Problem is its way to slow, it takes about 50min to do 10k rows (multithreading is a future option), and i want it to be able to handle multiple million rows. There must be a better and faster way to do this. Thanks for your help.
Answers:
To merge two dataframes on their geometries (not on column or index values), use one of geopandas’s spatial joins. They have a whole section of the docs about it – it’s great – give it a read!
There are two workhorse spatial join functions in geopandas:
-
GeoDataFrame.sjoin
joins two dataframes based on a binary predicate performed on all combinations of geometries, one of intersects
, contains
, within
, touches
, crosses
, or overlaps
. You can specify whether you want a left
, right
, or inner
join based on the how
keyword argument
-
GeoDataFrame.sjoin_nearest
joins two dataframes based on which geometry in one dataframe is closest to each element in the other. Similarly, the how
argument gives left
, right
, and inner
options. Additionally, there are two arguments to sjoin_nearest
not available on sjoin
:
-
max_distance
: The max_distance argument specifies a maximum search radius for matching geometries. This can have a considerable performance impact in some cases. If you can, it is highly recommended that you use this parameter.
-
distance_col
: If set, the resultant GeoDataFrame will include a column with this name containing the computed distances between an input geometry and the nearest geometry.
You can optionally use these global geopandas.sjoin
and geopandas.sjoin_nearest
functions, or use the methods geopandas.GeoDataFrame.sjoin
and geopandas.GeoDataFrame.sjoin_nearest
. Note, however, that the docs include a warning that the root-level functions may be deprecated at some point in the future, and recommend the use of the GeoDataFrame methods.
So in your case:
merged = gdf_poly.sjoin(gdf_point, predicate="contains")
will do the trick, though if you want to match polygons where the point falls exactly on the boundary, you may want to consider predicate="intersects"
.
i have two GeoDataFrames
gdf_point:
Unnamed: 0 latitude longitude geometry
0 0 50.410203 7.236583 POINT (7.23658 50.41020)
1 1 51.303545 7.263082 POINT (7.26308 51.30354)
2 2 50.114965 8.672785 POINT (8.67278 50.11496)
and gdf_poly:
Unnamed: 0 Id geometry
0 0 301286 POLYGON ((9.67079 49.86762, 9.67079 49.86987, ...
1 1 302258 POLYGON ((9.67137 54.75650, 9.67137 54.75874, ...
2 2 302548 POLYGON ((9.66808 48.21535, 9.66808 48.21760, ...
I want to match if a point from gdf_point is contained by any of the polygons of gdf_poly, if yes i want the Id of that polygon to be added to the corresponding row of gdf_point.
Here is my current code:
COUNTER = 0
def f(x, gdf_poly, df_new_point):
global COUNTER
for row in gdf_poly.itertuples():
geom = getattr(row, 'geometry')
id = getattr(row, 'Id')
if geom.contains(x):
print('True')
df_new_point.loc[COUNTER, 'Id'] = id
COUNTER = COUNTER + 1
df_new_point = gdf_point
gdf_point['geometry'].apply(lambda x: f(x, gdf_poly, df_new_point))
This works and does what i want it to do. But the Problem is its way to slow, it takes about 50min to do 10k rows (multithreading is a future option), and i want it to be able to handle multiple million rows. There must be a better and faster way to do this. Thanks for your help.
To merge two dataframes on their geometries (not on column or index values), use one of geopandas’s spatial joins. They have a whole section of the docs about it – it’s great – give it a read!
There are two workhorse spatial join functions in geopandas:
-
GeoDataFrame.sjoin
joins two dataframes based on a binary predicate performed on all combinations of geometries, one ofintersects
,contains
,within
,touches
,crosses
, oroverlaps
. You can specify whether you want aleft
,right
, orinner
join based on thehow
keyword argument -
GeoDataFrame.sjoin_nearest
joins two dataframes based on which geometry in one dataframe is closest to each element in the other. Similarly, thehow
argument givesleft
,right
, andinner
options. Additionally, there are two arguments tosjoin_nearest
not available onsjoin
:-
max_distance
: The max_distance argument specifies a maximum search radius for matching geometries. This can have a considerable performance impact in some cases. If you can, it is highly recommended that you use this parameter. -
distance_col
: If set, the resultant GeoDataFrame will include a column with this name containing the computed distances between an input geometry and the nearest geometry.
-
You can optionally use these global geopandas.sjoin
and geopandas.sjoin_nearest
functions, or use the methods geopandas.GeoDataFrame.sjoin
and geopandas.GeoDataFrame.sjoin_nearest
. Note, however, that the docs include a warning that the root-level functions may be deprecated at some point in the future, and recommend the use of the GeoDataFrame methods.
So in your case:
merged = gdf_poly.sjoin(gdf_point, predicate="contains")
will do the trick, though if you want to match polygons where the point falls exactly on the boundary, you may want to consider predicate="intersects"
.