Python – How can I aggregate a pandas dataframe base on conditions on different rows?
Question:
I have a pandas data frame with information about road segments.
PRIRTECODE
PRIM_BMP
PRIM_EMP
SEGMENT_LENGTH
ELEMENT_ID
RAMP
CURVE_YEAR
SEGMENT_TYPE
0001A
0
0.147
0.147
4850943
0
2019
Line
0001A
0.147
0.183
0.036
4850943
0
2019
Line
0001A
0.183
0.24
0.057
4850943
0
2019
Arc left
0001A
0.24
0.251
0.011
4850945
0
2019
Arc left
0001A
0.251
0.27
0.019
4850945
0
2019
Arc left
0001A
0.27
0.295
0.025
4048920
0
2019
Arc left
0001A
0.295
0.31
0.015
4048920
0
2019
Line
0001A
0.31
0.36
0.05
4048921
0
2019
Line
0001A
0.36
0.363
0.003
4048779
0
2019
Line
0001A
0.363
0.437
0.074
4048779
0
2019
Arc left
0001A
0.437
0.483
0.046
4048779
0
2019
Arc right
0001A
0.483
0.568
0.085
4048779
0
2019
Arc right
0001A
0.568
0.6
0.032
4048779
0
2019
Line
I need to aggregate based on similar characteristics as SEGMENT TYPE, and sum the SEGMENT_LENGTH. I can do this with pandas group_by. However, I need to make sure that the segments to aggregate are contiguous. To do that, I need to look the following variables:
- PRIM_BMP: mile in which the segment begins.
- PRIM_EMP: mile in which the segment ends.
So two segments are continuous if the PRIM_EMP of one segment is equal to the PRIM_BMP of the second segment. Also, I need to keep the PRIM_BMP of the first segment and the PRIM_EMP of the last segment.
The end result should look like this:
PRIRTECODE
PRIM_BMP
PRIM_EMP
SEGMENT_LENGTH
RAMP
CURVE_YEAR
SEGMENT_TYPE
0001A
0
0.183
0.183
0
2019
Line
0001A
0.183
0.295
0.112
0
2019
Arc left
0001A
0.295
0.363
0.068
0
2019
Line
0001A
0.363
0.568
0.205
0
2019
Arc right
0001A
0.568
0.6
0.032
0
2019
Line
I have tried with groupby using the characteristic in which I need to aggregate the segments, but I have not found a way to aggregate solely the contiguous segments.
Answers:
Just in case you run out of options, here’s convtools based solution. (I must confess – I’m the author).
from convtools import conversion as c
from convtools.contrib.tables import Table
rows = [
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.0, 'PRIM_EMP': 0.147, 'SEGMENT_LENGTH': 0.147, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.147, 'PRIM_EMP': 0.183, 'SEGMENT_LENGTH': 0.036, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.183, 'PRIM_EMP': 0.24, 'SEGMENT_LENGTH': 0.057, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.24, 'PRIM_EMP': 0.251, 'SEGMENT_LENGTH': 0.011, 'ELEMENT_ID': 4850945, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.251, 'PRIM_EMP': 0.27, 'SEGMENT_LENGTH': 0.019, 'ELEMENT_ID': 4850945, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.27, 'PRIM_EMP': 0.295, 'SEGMENT_LENGTH': 0.025, 'ELEMENT_ID': 4048920, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.295, 'PRIM_EMP': 0.31, 'SEGMENT_LENGTH': 0.015, 'ELEMENT_ID': 4048920, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.31, 'PRIM_EMP': 0.36, 'SEGMENT_LENGTH': 0.05, 'ELEMENT_ID': 4048921, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.36, 'PRIM_EMP': 0.363, 'SEGMENT_LENGTH': 0.003, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.363, 'PRIM_EMP': 0.437, 'SEGMENT_LENGTH': 0.074, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.437, 'PRIM_EMP': 0.483, 'SEGMENT_LENGTH': 0.046, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc right'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.483, 'PRIM_EMP': 0.568, 'SEGMENT_LENGTH': 0.085, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc right'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.568, 'PRIM_EMP': 0.6, 'SEGMENT_LENGTH': 0.032, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
]
iterable_of_results = (
c.chunk_by_condition(
c.and_(
c.CHUNK.item(-1, "SEGMENT_TYPE") == c.item("SEGMENT_TYPE"),
c.CHUNK.item(-1, "PRIM_EMP") == c.item("PRIM_BMP"),
)
)
.aggregate(
{
"SEGMENT_TYPE": c.ReduceFuncs.First(c.item("SEGMENT_TYPE")),
"length": c.ReduceFuncs.Sum(c.item("SEGMENT_LENGTH")),
}
)
# should you want to reuse this conversion multiple times, run
# .gen_converter() to get a function and store it for further reuse
.execute(rows)
)
Result:
In [54]: list(iterable_of_results)
Out[54]:
[{'SEGMENT_TYPE': 'Line', 'length': 0.183},
{'SEGMENT_TYPE': 'Arc left', 'length': 0.11200000000000002},
{'SEGMENT_TYPE': 'Line', 'length': 0.068},
{'SEGMENT_TYPE': 'Arc left', 'length': 0.074},
{'SEGMENT_TYPE': 'Arc right', 'length': 0.131},
{'SEGMENT_TYPE': 'Line', 'length': 0.032}]
A way to do it using numpy
and pandas
.
- Sort the data based on the route code, and the mileposts.
# sort the data
df_roads = df_roads.sort_values(by=['PRIRTE_CODE', 'PRIRTE_BMP', 'PRIRTE_EMP'])
- Create a new IDs to identify the segments that share the characteristics based on we want to perform the merge, e.g., rows (road segments) 1 and 2 should be merged, then both should have the same id.
# create new IDs
df_roads['ID'] = 0 # create the id column
new_id = 1 # start the id value (consecutive number from 1)
df_roads['ID'].iloc[0] = new_id # assign the first id to the first row
for row in range(df_roads.shape[0]-1): #iterate over the rows to check conditions and assign the id
if df_roads['PRIRTECODE'].iloc[row]==df_roads['PRIRTECODE'].iloc[row+1]:
if df_roads['PRIM_EMP'].iloc[row]==df_roads['PRIM_BMP'].iloc[row+1]:
if df_roads['SEGMENT_TYPE'].iloc[row]==df_roads['SEGMENT_TYPE'].iloc[row+1]:
new_id
else:
new_id += 1
else:
new_id += 1
else:
new_id += 1
df_roads['ID'].iloc[row+1] = new_id
- Aggregate the data based on the previusly created IDs.
# aggregate the segments
new_ids = df_roads['ID'].unique()
df_roads_agg = pd.DataFrame(index=range(len(new_ids)), columns=df_roads.columns)
begin = time.perf_counter()
for id_iter in range(len(new_ids)):
segments = df_roads[df_roads['ID']==new_ids[id_iter]] #subset the rows with same id
df_roads_agg.iloc[id_iter,:] = segments.iloc[0,:]
df_roads_agg['PRIM_BMP'].iloc[id_iter] = segments['PRIM_BMP'].min() # the begin will be the minimun of the begin milepost of the segments
df_roads_agg['PRIM_EMP'].iloc[id_iter] = segments['PRIM_EMP'].max() # the end will be the maximun of the end milepost of the segments
df_roads_agg['SEGMENT_TYPE'] = segments['SEGMENT_TYPE'].first()
df_roads_agg['SEGMENT_LENGTH'].iloc[id_iter] = df_roads_agg['PRIM_EMP'].iloc[id_iter] - df_roads_agg['PRIM_BMP'].iloc[id_iter]
df_roads_agg['ANNUAL_ADT'].iloc[id_iter] = np.average(a=segments['ANNUAL_ADT'], weights=segments['SEGMENT_LENGTH'])
I have a pandas data frame with information about road segments.
PRIRTECODE | PRIM_BMP | PRIM_EMP | SEGMENT_LENGTH | ELEMENT_ID | RAMP | CURVE_YEAR | SEGMENT_TYPE |
---|---|---|---|---|---|---|---|
0001A | 0 | 0.147 | 0.147 | 4850943 | 0 | 2019 | Line |
0001A | 0.147 | 0.183 | 0.036 | 4850943 | 0 | 2019 | Line |
0001A | 0.183 | 0.24 | 0.057 | 4850943 | 0 | 2019 | Arc left |
0001A | 0.24 | 0.251 | 0.011 | 4850945 | 0 | 2019 | Arc left |
0001A | 0.251 | 0.27 | 0.019 | 4850945 | 0 | 2019 | Arc left |
0001A | 0.27 | 0.295 | 0.025 | 4048920 | 0 | 2019 | Arc left |
0001A | 0.295 | 0.31 | 0.015 | 4048920 | 0 | 2019 | Line |
0001A | 0.31 | 0.36 | 0.05 | 4048921 | 0 | 2019 | Line |
0001A | 0.36 | 0.363 | 0.003 | 4048779 | 0 | 2019 | Line |
0001A | 0.363 | 0.437 | 0.074 | 4048779 | 0 | 2019 | Arc left |
0001A | 0.437 | 0.483 | 0.046 | 4048779 | 0 | 2019 | Arc right |
0001A | 0.483 | 0.568 | 0.085 | 4048779 | 0 | 2019 | Arc right |
0001A | 0.568 | 0.6 | 0.032 | 4048779 | 0 | 2019 | Line |
I need to aggregate based on similar characteristics as SEGMENT TYPE, and sum the SEGMENT_LENGTH. I can do this with pandas group_by. However, I need to make sure that the segments to aggregate are contiguous. To do that, I need to look the following variables:
- PRIM_BMP: mile in which the segment begins.
- PRIM_EMP: mile in which the segment ends.
So two segments are continuous if the PRIM_EMP of one segment is equal to the PRIM_BMP of the second segment. Also, I need to keep the PRIM_BMP of the first segment and the PRIM_EMP of the last segment.
The end result should look like this:
PRIRTECODE | PRIM_BMP | PRIM_EMP | SEGMENT_LENGTH | RAMP | CURVE_YEAR | SEGMENT_TYPE |
---|---|---|---|---|---|---|
0001A | 0 | 0.183 | 0.183 | 0 | 2019 | Line |
0001A | 0.183 | 0.295 | 0.112 | 0 | 2019 | Arc left |
0001A | 0.295 | 0.363 | 0.068 | 0 | 2019 | Line |
0001A | 0.363 | 0.568 | 0.205 | 0 | 2019 | Arc right |
0001A | 0.568 | 0.6 | 0.032 | 0 | 2019 | Line |
I have tried with groupby using the characteristic in which I need to aggregate the segments, but I have not found a way to aggregate solely the contiguous segments.
Just in case you run out of options, here’s convtools based solution. (I must confess – I’m the author).
from convtools import conversion as c
from convtools.contrib.tables import Table
rows = [
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.0, 'PRIM_EMP': 0.147, 'SEGMENT_LENGTH': 0.147, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.147, 'PRIM_EMP': 0.183, 'SEGMENT_LENGTH': 0.036, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.183, 'PRIM_EMP': 0.24, 'SEGMENT_LENGTH': 0.057, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.24, 'PRIM_EMP': 0.251, 'SEGMENT_LENGTH': 0.011, 'ELEMENT_ID': 4850945, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.251, 'PRIM_EMP': 0.27, 'SEGMENT_LENGTH': 0.019, 'ELEMENT_ID': 4850945, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.27, 'PRIM_EMP': 0.295, 'SEGMENT_LENGTH': 0.025, 'ELEMENT_ID': 4048920, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.295, 'PRIM_EMP': 0.31, 'SEGMENT_LENGTH': 0.015, 'ELEMENT_ID': 4048920, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.31, 'PRIM_EMP': 0.36, 'SEGMENT_LENGTH': 0.05, 'ELEMENT_ID': 4048921, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.36, 'PRIM_EMP': 0.363, 'SEGMENT_LENGTH': 0.003, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.363, 'PRIM_EMP': 0.437, 'SEGMENT_LENGTH': 0.074, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.437, 'PRIM_EMP': 0.483, 'SEGMENT_LENGTH': 0.046, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc right'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.483, 'PRIM_EMP': 0.568, 'SEGMENT_LENGTH': 0.085, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc right'},
{'PRIRTECODE': '0001A', 'PRIM_BMP': 0.568, 'PRIM_EMP': 0.6, 'SEGMENT_LENGTH': 0.032, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
]
iterable_of_results = (
c.chunk_by_condition(
c.and_(
c.CHUNK.item(-1, "SEGMENT_TYPE") == c.item("SEGMENT_TYPE"),
c.CHUNK.item(-1, "PRIM_EMP") == c.item("PRIM_BMP"),
)
)
.aggregate(
{
"SEGMENT_TYPE": c.ReduceFuncs.First(c.item("SEGMENT_TYPE")),
"length": c.ReduceFuncs.Sum(c.item("SEGMENT_LENGTH")),
}
)
# should you want to reuse this conversion multiple times, run
# .gen_converter() to get a function and store it for further reuse
.execute(rows)
)
Result:
In [54]: list(iterable_of_results)
Out[54]:
[{'SEGMENT_TYPE': 'Line', 'length': 0.183},
{'SEGMENT_TYPE': 'Arc left', 'length': 0.11200000000000002},
{'SEGMENT_TYPE': 'Line', 'length': 0.068},
{'SEGMENT_TYPE': 'Arc left', 'length': 0.074},
{'SEGMENT_TYPE': 'Arc right', 'length': 0.131},
{'SEGMENT_TYPE': 'Line', 'length': 0.032}]
A way to do it using numpy
and pandas
.
- Sort the data based on the route code, and the mileposts.
# sort the data
df_roads = df_roads.sort_values(by=['PRIRTE_CODE', 'PRIRTE_BMP', 'PRIRTE_EMP'])
- Create a new IDs to identify the segments that share the characteristics based on we want to perform the merge, e.g., rows (road segments) 1 and 2 should be merged, then both should have the same id.
# create new IDs
df_roads['ID'] = 0 # create the id column
new_id = 1 # start the id value (consecutive number from 1)
df_roads['ID'].iloc[0] = new_id # assign the first id to the first row
for row in range(df_roads.shape[0]-1): #iterate over the rows to check conditions and assign the id
if df_roads['PRIRTECODE'].iloc[row]==df_roads['PRIRTECODE'].iloc[row+1]:
if df_roads['PRIM_EMP'].iloc[row]==df_roads['PRIM_BMP'].iloc[row+1]:
if df_roads['SEGMENT_TYPE'].iloc[row]==df_roads['SEGMENT_TYPE'].iloc[row+1]:
new_id
else:
new_id += 1
else:
new_id += 1
else:
new_id += 1
df_roads['ID'].iloc[row+1] = new_id
- Aggregate the data based on the previusly created IDs.
# aggregate the segments
new_ids = df_roads['ID'].unique()
df_roads_agg = pd.DataFrame(index=range(len(new_ids)), columns=df_roads.columns)
begin = time.perf_counter()
for id_iter in range(len(new_ids)):
segments = df_roads[df_roads['ID']==new_ids[id_iter]] #subset the rows with same id
df_roads_agg.iloc[id_iter,:] = segments.iloc[0,:]
df_roads_agg['PRIM_BMP'].iloc[id_iter] = segments['PRIM_BMP'].min() # the begin will be the minimun of the begin milepost of the segments
df_roads_agg['PRIM_EMP'].iloc[id_iter] = segments['PRIM_EMP'].max() # the end will be the maximun of the end milepost of the segments
df_roads_agg['SEGMENT_TYPE'] = segments['SEGMENT_TYPE'].first()
df_roads_agg['SEGMENT_LENGTH'].iloc[id_iter] = df_roads_agg['PRIM_EMP'].iloc[id_iter] - df_roads_agg['PRIM_BMP'].iloc[id_iter]
df_roads_agg['ANNUAL_ADT'].iloc[id_iter] = np.average(a=segments['ANNUAL_ADT'], weights=segments['SEGMENT_LENGTH'])