Find overlapping numeric ranges between 2 columns pandas and subtract difference of another column

Question:

This is a toy dataset:

df = pd.DataFrame({'ID': ['A','A','A','A'],
                     'target': ['B','B','B','B'],
                     'length':[208,315,1987,3775],
                     'start':[139403,140668,141726,143705],
                     'end':[139609,140982,143711,147467]})

ID  target  length  start   end
0   A   B   208     139403  139609
1   A   B   315     140668  140982
2   A   B   1987    141726  143711
3   A   B   3775    143705  147467

I need to sum the length column taking overlapping ranges in start and end into consideration.

In row 3, the start value 143705 is between the ranges of row 2’s 141726-143711.

143711 – 143705 = 6 so I need to subtract 6 from the length in row 3: 3775 – 6 = 3769.

Which would change the df to:

ID  target  length  start   end
0   A   B   208     139403  139609
1   A   B   315     140668  140982
2   A   B   1987    141726  143711
3   A   B   3769    143705  147467

Then grouping by ID, target and summing length:

df.groupby(['ID','target'])['length'].sum()
ID  target
A   B         6279
Name: length, dtype: int64

Does anyone know how I can do this in a pythonic way? Thanks so much for any help!

Asked By: skiventist

||

Answers:

A possible solution:

(df.assign(length=
           df['start'].lt(df['end'].shift())
           .mul(df['start']-df['end'].shift(fill_value=0))
           .add(df['length'])))

Output:

  ID target  length   start     end
0  A      B     208  139403  139609
1  A      B     315  140668  140982
2  A      B    1987  141726  143711
3  A      B    3769  143705  147467
Answered By: PaulS
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.