Python – How can I aggregate a pandas dataframe base on conditions on different rows?

Question

I have a pandas data frame with information about road segments.

PRIRTECODE	PRIM_BMP	PRIM_EMP	SEGMENT_LENGTH	ELEMENT_ID	CURVE_YEAR	SEGMENT_TYPE
0001A	0	0.147	0.147	4850943	2019	Line
0001A	0.147	0.183	0.036	4850943	2019	Line
0001A	0.183	0.24	0.057	4850943	2019	Arc left
0001A	0.24	0.251	0.011	4850945	2019	Arc left
0001A	0.251	0.27	0.019	4850945	2019	Arc left
0001A	0.27	0.295	0.025	4048920	2019	Arc left
0001A	0.295	0.31	0.015	4048920	2019	Line
0001A	0.31	0.36	0.05	4048921	2019	Line
0001A	0.36	0.363	0.003	4048779	2019	Line
0001A	0.363	0.437	0.074	4048779	2019	Arc left
0001A	0.437	0.483	0.046	4048779	2019	Arc right
0001A	0.483	0.568	0.085	4048779	2019	Arc right
0001A	0.568	0.6	0.032	4048779	2019	Line

I need to aggregate based on similar characteristics as SEGMENT TYPE, and sum the SEGMENT_LENGTH. I can do this with pandas group_by. However, I need to make sure that the segments to aggregate are contiguous. To do that, I need to look the following variables:

PRIM_BMP: mile in which the segment begins.
PRIM_EMP: mile in which the segment ends.

So two segments are continuous if the PRIM_EMP of one segment is equal to the PRIM_BMP of the second segment. Also, I need to keep the PRIM_BMP of the first segment and the PRIM_EMP of the last segment.

The end result should look like this:

PRIRTECODE	PRIM_BMP	PRIM_EMP	SEGMENT_LENGTH	CURVE_YEAR	SEGMENT_TYPE
0001A	0	0.183	0.183	2019	Line
0001A	0.183	0.295	0.112	2019	Arc left
0001A	0.295	0.363	0.068	2019	Line
0001A	0.363	0.568	0.205	2019	Arc right
0001A	0.568	0.6	0.032	2019	Line

I have tried with groupby using the characteristic in which I need to aggregate the segments, but I have not found a way to aggregate solely the contiguous segments.

Asked By: Jhan

||

Source

Answer 1

Just in case you run out of options, here’s convtools based solution. (I must confess – I’m the author).

from convtools import conversion as c
from convtools.contrib.tables import Table


rows = [
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.0, 'PRIM_EMP': 0.147, 'SEGMENT_LENGTH': 0.147, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.147, 'PRIM_EMP': 0.183, 'SEGMENT_LENGTH': 0.036, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.183, 'PRIM_EMP': 0.24, 'SEGMENT_LENGTH': 0.057, 'ELEMENT_ID': 4850943, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.24, 'PRIM_EMP': 0.251, 'SEGMENT_LENGTH': 0.011, 'ELEMENT_ID': 4850945, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.251, 'PRIM_EMP': 0.27, 'SEGMENT_LENGTH': 0.019, 'ELEMENT_ID': 4850945, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.27, 'PRIM_EMP': 0.295, 'SEGMENT_LENGTH': 0.025, 'ELEMENT_ID': 4048920, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.295, 'PRIM_EMP': 0.31, 'SEGMENT_LENGTH': 0.015, 'ELEMENT_ID': 4048920, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.31, 'PRIM_EMP': 0.36, 'SEGMENT_LENGTH': 0.05, 'ELEMENT_ID': 4048921, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.36, 'PRIM_EMP': 0.363, 'SEGMENT_LENGTH': 0.003, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.363, 'PRIM_EMP': 0.437, 'SEGMENT_LENGTH': 0.074, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc left'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.437, 'PRIM_EMP': 0.483, 'SEGMENT_LENGTH': 0.046, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc right'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.483, 'PRIM_EMP': 0.568, 'SEGMENT_LENGTH': 0.085, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Arc right'},
    {'PRIRTECODE': '0001A', 'PRIM_BMP': 0.568, 'PRIM_EMP': 0.6, 'SEGMENT_LENGTH': 0.032, 'ELEMENT_ID': 4048779, 'RAMP': 0, 'CURVE_YEAR': 2019, 'SEGMENT_TYPE': 'Line'},
]

iterable_of_results = (
    c.chunk_by_condition(
        c.and_(
            c.CHUNK.item(-1, "SEGMENT_TYPE") == c.item("SEGMENT_TYPE"),
            c.CHUNK.item(-1, "PRIM_EMP") == c.item("PRIM_BMP"),
        )
    )
    .aggregate(
        {
            "SEGMENT_TYPE": c.ReduceFuncs.First(c.item("SEGMENT_TYPE")),
            "length": c.ReduceFuncs.Sum(c.item("SEGMENT_LENGTH")),
        }
    )
    # should you want to reuse this conversion multiple times, run
    # .gen_converter() to get a function and store it for further reuse
    .execute(rows)
)

Result:

In [54]: list(iterable_of_results)
Out[54]:
[{'SEGMENT_TYPE': 'Line', 'length': 0.183},
 {'SEGMENT_TYPE': 'Arc left', 'length': 0.11200000000000002},
 {'SEGMENT_TYPE': 'Line', 'length': 0.068},
 {'SEGMENT_TYPE': 'Arc left', 'length': 0.074},
 {'SEGMENT_TYPE': 'Arc right', 'length': 0.131},
 {'SEGMENT_TYPE': 'Line', 'length': 0.032}]

Answered By: westandskif

Answer 2

A way to do it using numpy and pandas.

Sort the data based on the route code, and the mileposts.

# sort the data
df_roads = df_roads.sort_values(by=['PRIRTE_CODE', 'PRIRTE_BMP', 'PRIRTE_EMP'])

Create a new IDs to identify the segments that share the characteristics based on we want to perform the merge, e.g., rows (road segments) 1 and 2 should be merged, then both should have the same id.

# create new IDs

df_roads['ID'] = 0 # create the id column

new_id = 1 # start the id value (consecutive number from 1)
df_roads['ID'].iloc[0] = new_id # assign the first id to the first row

for row in range(df_roads.shape[0]-1): #iterate over the rows to check conditions and assign the id
    if df_roads['PRIRTECODE'].iloc[row]==df_roads['PRIRTECODE'].iloc[row+1]:
        if df_roads['PRIM_EMP'].iloc[row]==df_roads['PRIM_BMP'].iloc[row+1]:
            if df_roads['SEGMENT_TYPE'].iloc[row]==df_roads['SEGMENT_TYPE'].iloc[row+1]:
                new_id
            else:
                new_id += 1
        else:
            new_id += 1
    else:
        new_id += 1
    
    df_roads['ID'].iloc[row+1] = new_id

Aggregate the data based on the previusly created IDs.

# aggregate the segments

new_ids = df_roads['ID'].unique()
df_roads_agg = pd.DataFrame(index=range(len(new_ids)), columns=df_roads.columns)

begin = time.perf_counter()

for id_iter in range(len(new_ids)):
    segments = df_roads[df_roads['ID']==new_ids[id_iter]] #subset the rows with same id
    df_roads_agg.iloc[id_iter,:] = segments.iloc[0,:]
    df_roads_agg['PRIM_BMP'].iloc[id_iter] = segments['PRIM_BMP'].min() # the begin will be the minimun of the begin milepost of the segments
    df_roads_agg['PRIM_EMP'].iloc[id_iter] = segments['PRIM_EMP'].max() # the end will be the maximun of the end milepost of the segments
    df_roads_agg['SEGMENT_TYPE'] = segments['SEGMENT_TYPE'].first()
    df_roads_agg['SEGMENT_LENGTH'].iloc[id_iter] = df_roads_agg['PRIM_EMP'].iloc[id_iter] - df_roads_agg['PRIM_BMP'].iloc[id_iter]
    df_roads_agg['ANNUAL_ADT'].iloc[id_iter] = np.average(a=segments['ANNUAL_ADT'], weights=segments['SEGMENT_LENGTH'])

Answered By: Jhan

Python – How can I aggregate a pandas dataframe base on conditions on different rows?

Question:

Answers: