Python: fast aggregation of many observations to daily sum

Question:

I have observations with start and end date of the following format:

import pandas as pd

data = pd.DataFrame({
    'start_date':pd.to_datetime(['2021-01-07','2021-01-04','2021-01-12','2021-01-03']),
    'end_date':pd.to_datetime(['2021-01-16','2021-01-12','2021-01-13','2021-01-15']),
    'value':[7,6,5,4]
    })

data

    start_date  end_date    value
0   2021-01-07  2021-01-16  7
1   2021-01-04  2021-01-12  6
2   2021-01-12  2021-01-13  5
3   2021-01-03  2021-01-15  4

The date ranges between observations overlap. I would like to compute the daily sum aggregated across all observations.

My version with a loop (below) is slow and crashes for ~100k observations. What would be a way to speed things up?

def turn_data_into_date_range(row):
  dates = pd.date_range(start=row.start_date, end=row.end_date)
  return pd.Series(data=row.value, index=dates)

out = []
for index, row in data.iterrows():
  out.append(turn_data_into_date_range(row))

result = pd.concat(out, axis=1).sum(axis=1)

result
2021-01-03     4.0
2021-01-04    10.0
2021-01-05    10.0
2021-01-06    10.0
2021-01-07    17.0
2021-01-08    17.0
2021-01-09    17.0
2021-01-10    17.0
2021-01-11    17.0
2021-01-12    22.0
2021-01-13    16.0
2021-01-14    11.0
2021-01-15    11.0
2021-01-16     7.0
Freq: D, dtype: float64

PS: the answer to this related question doesn’t work in my case, as they have non-overlapping observations and can use a left join: Convert Date Ranges to Time Series in Pandas

Answers:

You can use explode to break each range into individual days:

data['day'] = data.apply(lambda row: pd.date_range(row['start_date'], row['end_date']), axis=1)
result = data[['day', 'value']].explode('day').groupby('day').sum()
Answered By: Code Different

What would be a way to speed things up?

You did

for index, row in data.iterrows():
  out.append(turn_data_into_date_range(row))

practical usage show that it is possible to get speed increase from using .itertuples() rather than .iterrows() see Why Pandas itertuples() Is Faster Than iterrows() and How To Make It Even Faster. I suggest reworking your code to use said .itertuples() method. I do not have ability to test it now, but I suspect your turn_data_into_date_range function might work without any changes, as said tuples support access via dot attribute.

Answered By: Daweo

I feel this problem comes back regularly as it’s not an easy thing to do. Some techniques would probably transform each row into a date range or otherwise iterate on rows. In this case there’s a smarter workaround, which is to use cumulative sums, then reindex.

>>> starts = data.set_index('start_date')['value'].sort_index().cumsum()
>>> starts
start_date
2021-01-03     4
2021-01-04    10
2021-01-07    17
2021-01-12    22
Name: value, dtype: int64
>>> ends = data.set_index('end_date')['value'].sort_index().cumsum()
>>> ends
end_date
2021-01-12     6
2021-01-13    11
2021-01-15    15
2021-01-16    22
Name: value, dtype: int64

In case your dates are not unique, you could group by by date and sum first. Then the series definitions are as follows:

>>> starts = data.groupby('start_date')['value'].sum().sort_index().cumsum()
>>> ends = data.groupby('end_date')['value'].sum().sort_index().cumsum()

Note that here we don’t need the set_index() anymore which is done by sum() as it is an aggregation, contrarily to .cumsum()which is a transform operation.

Of course if the ends are inclusive you might need to add a .shift():

>>> dates = pd.date_range(starts.index.min(), ends.index.max())
>>> ends.reindex(dates).ffill().shift().fillna(0)
2021-01-03     0.0
2021-01-04     0.0
2021-01-05     0.0
2021-01-06     0.0
2021-01-07     0.0
2021-01-08     0.0
2021-01-09     0.0
2021-01-10     0.0
2021-01-11     0.0
2021-01-12     0.0
2021-01-13     6.0
2021-01-14    11.0
2021-01-15    11.0
2021-01-16    15.0
Freq: D, Name: value, dtype: float64

Then just subtract the (possibly shifted) ends from the starts:

>>> starts.reindex(dates).ffill() - ends.reindex(dates).ffill().shift().fillna(0)
2021-01-03     4.0
2021-01-04    10.0
2021-01-05    10.0
2021-01-06    10.0
2021-01-07    17.0
2021-01-08    17.0
2021-01-09    17.0
2021-01-10    17.0
2021-01-11    17.0
2021-01-12    22.0
2021-01-13    16.0
2021-01-14    11.0
2021-01-15    11.0
2021-01-16     7.0
Freq: D, Name: value, dtype: float64
Answered By: Cimbali

One option is to solve this with an inequality join, using the conditional_join from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

# Build a Series of the all dates:
dates = data.filter(like='date')
start = dates.min().min()
end = dates.max().max()
dates = pd.date_range(start, end, freq='D', name = 'dates')
dates = pd.Series(dates)

(data
.conditional_join(
    dates, 
    ('start_date', 'dates', '<='), 
    ('end_date', 'dates', '>='),
    # depending on the data size, 
    # numba offers more performance
    use_numba=False, 
    df_columns='value')
.groupby('dates')
.sum()
)

            value
dates
2021-01-03      4
2021-01-04     10
2021-01-05     10
2021-01-06     10
2021-01-07     17
2021-01-08     17
2021-01-09     17
2021-01-10     17
2021-01-11     17
2021-01-12     22
2021-01-13     16
2021-01-14     11
2021-01-15     11
2021-01-16      7
Answered By: sammywemmy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.