Sum column data while merging zip_code polygons to MultiPolygons in geopandas
Question:
I m working with python on a Jupyter notebook
I have the following dataset:
+-------+------------+----------+---------------------------------------------------+
| zip | population | area# | polygon |
+-------+------------+----------+---------------------------------------------------+
| 12345 | 50 | 55 | POLYGON ((-55.66788 40.04416, -55.66790 40.044... |
| 12346 | 100 | 55 | POLYGON ((-55.54666 40.40131, -55.54678 40.400... |
| . | . | . | . |
| . | . | . | . |
| 98765 | 236667 | 155 | POLYGON ((-155.04682 78.53585, -155.04680 78.5.. |
+-------+--------+--------------+---------------------------------------------------+
Where the polygon
column is a geopandas.GeoSeries
and each geometry element is a shapely.geometry.polygon.Polygon
.
I transformed the dataset into a geodataframe:
from geopandas import GeoDataFrame
dataset = GeoDataFrame(dataset)
And used the set_geometry
function to assign the geometry column:
dataset = dataset.set_geometry("polygon")
Everything seems to be working fine and I am able to plot heatmaps using this GeoDataFrame.
The issue I am having is that I am trying to create a dataset grouping the population per area, but I also have to group the polygons, which I have been failing to do so.
the final dataset should look like this, with all the zip
polygons with the same area#
should be collapsed into a single row with a MultiPolygon
geometry and the total of the population
values:
+------------+----------+--------------------------------------------------------+
| population | area# | polygon |
+------------+----------+--------------------------------------------------------+
| 150 | 55 | MULTYPOLYGON ((-55.66788 40.04416, -55.66790 40.044... |
| . | . | . |
| . | . | . |
| . | . | . |
| 236667 | 155 | MULTYPOLYGON ((-155.04682 78.53585, -155.04680 78.5.. |
+------------+----------+--------------------------------------------------------+
I really don’t need to follow the steps I outlined before, these are the steps I found here on Stack Overflow. I am ok doing something else from scratch.
Answers:
The geopandas spatial equivalent of a pandas .groupby().aggreagte()
operation is dissolve. Take a look through the docs, they’re really helpful.
One key argument to note is the aggfunc
argument. From the docs:
The aggfunc =
argument defaults to ‘first’ which means that the first row of attributes values found in the dissolve routine will be assigned to the resultant dissolved geodataframe. However it also accepts other summary statistic options as allowed by pandas.groupby
including:
- ‘first’
- ‘last’
- ‘min’
- ‘max’
- ‘sum’
- ‘mean’
- ‘median’
function
string function name
list of functions and/or function names, e.g. [np.sum, ‘mean’]
dict of axis labels -> functions, function names or list of such.
If you’re looking to group on area
, and sum the populations within each area, as well as unify the polygons, you can use aggfunc={"population": "sum"}
, e.g.:
aggregated = dataset.dissolve("area#", aggfunc={"population": "sum"})
I m working with python on a Jupyter notebook
I have the following dataset:
+-------+------------+----------+---------------------------------------------------+
| zip | population | area# | polygon |
+-------+------------+----------+---------------------------------------------------+
| 12345 | 50 | 55 | POLYGON ((-55.66788 40.04416, -55.66790 40.044... |
| 12346 | 100 | 55 | POLYGON ((-55.54666 40.40131, -55.54678 40.400... |
| . | . | . | . |
| . | . | . | . |
| 98765 | 236667 | 155 | POLYGON ((-155.04682 78.53585, -155.04680 78.5.. |
+-------+--------+--------------+---------------------------------------------------+
Where the polygon
column is a geopandas.GeoSeries
and each geometry element is a shapely.geometry.polygon.Polygon
.
I transformed the dataset into a geodataframe:
from geopandas import GeoDataFrame
dataset = GeoDataFrame(dataset)
And used the set_geometry
function to assign the geometry column:
dataset = dataset.set_geometry("polygon")
Everything seems to be working fine and I am able to plot heatmaps using this GeoDataFrame.
The issue I am having is that I am trying to create a dataset grouping the population per area, but I also have to group the polygons, which I have been failing to do so.
the final dataset should look like this, with all the zip
polygons with the same area#
should be collapsed into a single row with a MultiPolygon
geometry and the total of the population
values:
+------------+----------+--------------------------------------------------------+
| population | area# | polygon |
+------------+----------+--------------------------------------------------------+
| 150 | 55 | MULTYPOLYGON ((-55.66788 40.04416, -55.66790 40.044... |
| . | . | . |
| . | . | . |
| . | . | . |
| 236667 | 155 | MULTYPOLYGON ((-155.04682 78.53585, -155.04680 78.5.. |
+------------+----------+--------------------------------------------------------+
I really don’t need to follow the steps I outlined before, these are the steps I found here on Stack Overflow. I am ok doing something else from scratch.
The geopandas spatial equivalent of a pandas .groupby().aggreagte()
operation is dissolve. Take a look through the docs, they’re really helpful.
One key argument to note is the aggfunc
argument. From the docs:
The
aggfunc =
argument defaults to ‘first’ which means that the first row of attributes values found in the dissolve routine will be assigned to the resultant dissolved geodataframe. However it also accepts other summary statistic options as allowed bypandas.groupby
including:
- ‘first’
- ‘last’
- ‘min’
- ‘max’
- ‘sum’
- ‘mean’
- ‘median’
function
string function name
list of functions and/or function names, e.g. [np.sum, ‘mean’]
dict of axis labels -> functions, function names or list of such.
If you’re looking to group on area
, and sum the populations within each area, as well as unify the polygons, you can use aggfunc={"population": "sum"}
, e.g.:
aggregated = dataset.dissolve("area#", aggfunc={"population": "sum"})