Python – How can I aggregate a pandas dataframe base on conditions on different rows?

Question:

I have a pandas data frame with information about road segments.

PRIRTECODE PRIM_BMP PRIM_EMP SEGMENT_LENGTH ELEMENT_ID RAMP CURVE_YEAR SEGMENT_TYPE
0001A 0 0.147 0.147 4850943 0 2019 Line
0001A 0.147 0.183 0.036 4850943 0 2019 Line
0001A 0.183 0.24 0.057 4850943 0 2019 Arc left
0001A 0.24 0.251 0.011 4850945 0 2019 Arc left
0001A 0.251 0.27 0.019 4850945 0 2019 Arc left
0001A 0.27 0.295 0.025 4048920 0 2019 Arc left
0001A 0.295 0.31 0.015 4048920 0 2019 Line
0001A 0.31 0.36 0.05 4048921 0 2019 Line
0001A 0.36 0.363 0.003 4048779 0 2019 Line
0001A 0.363 0.437 0.074 4048779 0 2019 Arc left
0001A 0.437 0.483 0.046 4048779 0 2019 Arc right
0001A 0.483 0.568 0.085 4048779 0 2019 Arc right
0001A 0.568 0.6 0.032 4048779 0 2019 Line

I need to aggregate based on similar characteristics as SEGMENT TYPE, and sum the SEGMENT_LENGTH. I can do this with pandas group_by. However, I need to make sure that the segments to aggregate are contiguous. To do that, I need to look the following variables:

  • PRIM_BMP: mile in which the segment begins.
  • PRIM_EMP: mile in which the segment ends.

So two segments are continuous if the PRIM_EMP of one segment is equal to the PRIM_BMP of the second segment. Also, I need to keep the PRIM_BMP of the first segment and the PRIM_EMP of the last segment.

The end result should look like this:

PRIRTECODE PRIM_BMP PRIM_EMP SEGMENT_LENGTH RAMP CURVE_YEAR SEGMENT_TYPE
0001A 0 0.183 0.183 0 2019 Line
0001A 0.183 0.295 0.112 0 2019 Arc left
0001A 0.295 0.363 0.068 0 2019 Line
0001A 0.363 0.568 0.205 0 2019 Arc right
0001A 0.568 0.6 0.032 0 2019 Line

I have tried with groupby using the characteristic in which I need to aggregate the segments, but I have not found a way to aggregate solely the contiguous segments.

Asked By: Jhan

||

Answers:

Just in case you run out of options, here’s convtools based solution. (I must confess – I’m the author).

from convtools import conversion as c
from convtools.contrib.tables import Table


rows = [
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.0, 'PRIM_EMP': 0.147, 'SEGMENT_LENGTH': 0.147, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.147, 'PRIM_EMP': 0.183, 'SEGMENT_LENGTH': 0.036, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.183, 'PRIM_EMP': 0.24, 'SEGMENT_LENGTH': 0.057, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.24, 'PRIM_EMP': 0.251, 'SEGMENT_LENGTH': 0.011, 'ELEMENT_ID': 4850945, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.251, 'PRIM_EMP': 0.27, 'SEGMENT_LENGTH': 0.019, 'ELEMENT_ID': 4850945, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.27, 'PRIM_EMP': 0.295, 'SEGMENT_LENGTH': 0.025, 'ELEMENT_ID': 4048920, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.295, 'PRIM_EMP': 0.31, 'SEGMENT_LENGTH': 0.015, 'ELEMENT_ID': 4048920, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.31, 'PRIM_EMP': 0.36, 'SEGMENT_LENGTH': 0.05, 'ELEMENT_ID': 4048921, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.36, 'PRIM_EMP': 0.363, 'SEGMENT_LENGTH': 0.003, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.363, 'PRIM_EMP': 0.437, 'SEGMENT_LENGTH': 0.074, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.437, 'PRIM_EMP': 0.483, 'SEGMENT_LENGTH': 0.046, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc right'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.483, 'PRIM_EMP': 0.568, 'SEGMENT_LENGTH': 0.085, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc right'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.568, 'PRIM_EMP': 0.6, 'SEGMENT_LENGTH': 0.032, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
]

iterable_of_results = (
    c.chunk_by_condition(
        c.and_(
            c.CHUNK.item(-1, "SEGMENT_TYPE") == c.item("SEGMENT_TYPE"),
            c.CHUNK.item(-1, "PRIM_EMP") == c.item("PRIM_BMP"),
        )
    )
    .aggregate(
        {
            "SEGMENT_TYPE": c.ReduceFuncs.First(c.item("SEGMENT_TYPE")),
            "length": c.ReduceFuncs.Sum(c.item("SEGMENT_LENGTH")),
        }
    )
    # should you want to reuse this conversion multiple times, run
    # .gen_converter() to get a function and store it for further reuse
    .execute(rows)
)

Result:

In [54]: list(iterable_of_results)
Out[54]:
[{'SEGMENT_TYPE': 'Line', 'length': 0.183},
 {'SEGMENT_TYPE': 'Arc left', 'length': 0.11200000000000002},
 {'SEGMENT_TYPE': 'Line', 'length': 0.068},
 {'SEGMENT_TYPE': 'Arc left', 'length': 0.074},
 {'SEGMENT_TYPE': 'Arc right', 'length': 0.131},
 {'SEGMENT_TYPE': 'Line', 'length': 0.032}]
Answered By: westandskif

A way to do it using numpy and pandas.

  1. Sort the data based on the route code, and the mileposts.
# sort the data
df_roads = df_roads.sort_values(by=['PRIRTE_CODE', 'PRIRTE_BMP', 'PRIRTE_EMP'])
  1. Create a new IDs to identify the segments that share the characteristics based on we want to perform the merge, e.g., rows (road segments) 1 and 2 should be merged, then both should have the same id.
# create new IDs

df_roads['ID'] = 0 # create the id column

new_id = 1 # start the id value (consecutive number from 1)
df_roads['ID'].iloc[0] = new_id # assign the first id to the first row

for row in range(df_roads.shape[0]-1): #iterate over the rows to check conditions and assign the id
    if df_roads['PRIRTECODE'].iloc[row]==df_roads['PRIRTECODE'].iloc[row+1]:
        if df_roads['PRIM_EMP'].iloc[row]==df_roads['PRIM_BMP'].iloc[row+1]:
            if df_roads['SEGMENT_TYPE'].iloc[row]==df_roads['SEGMENT_TYPE'].iloc[row+1]:
                new_id
            else:
                new_id += 1
        else:
            new_id += 1
    else:
        new_id += 1
    
    df_roads['ID'].iloc[row+1] = new_id
  1. Aggregate the data based on the previusly created IDs.
# aggregate the segments

new_ids = df_roads['ID'].unique()
df_roads_agg = pd.DataFrame(index=range(len(new_ids)), columns=df_roads.columns)

begin = time.perf_counter()

for id_iter in range(len(new_ids)):
    segments = df_roads[df_roads['ID']==new_ids[id_iter]] #subset the rows with same id
    df_roads_agg.iloc[id_iter,:] = segments.iloc[0,:]
    df_roads_agg['PRIM_BMP'].iloc[id_iter] = segments['PRIM_BMP'].min() # the begin will be the minimun of the begin milepost of the segments
    df_roads_agg['PRIM_EMP'].iloc[id_iter] = segments['PRIM_EMP'].max() # the end will be the maximun of the end milepost of the segments
    df_roads_agg['SEGMENT_TYPE'] = segments['SEGMENT_TYPE'].first()
    df_roads_agg['SEGMENT_LENGTH'].iloc[id_iter] = df_roads_agg['PRIM_EMP'].iloc[id_iter] - df_roads_agg['PRIM_BMP'].iloc[id_iter]
    df_roads_agg['ANNUAL_ADT'].iloc[id_iter] = np.average(a=segments['ANNUAL_ADT'], weights=segments['SEGMENT_LENGTH'])
Answered By: Jhan