Python: fast aggregation of many observations to daily sum
Question:
I have observations with start and end date of the following format:
import pandas as pd
data = pd.DataFrame({
'start_date':pd.to_datetime(['2021-01-07','2021-01-04','2021-01-12','2021-01-03']),
'end_date':pd.to_datetime(['2021-01-16','2021-01-12','2021-01-13','2021-01-15']),
'value':[7,6,5,4]
})
data
start_date end_date value
0 2021-01-07 2021-01-16 7
1 2021-01-04 2021-01-12 6
2 2021-01-12 2021-01-13 5
3 2021-01-03 2021-01-15 4
The date ranges between observations overlap. I would like to compute the daily sum aggregated across all observations.
My version with a loop (below) is slow and crashes for ~100k observations. What would be a way to speed things up?
def turn_data_into_date_range(row):
dates = pd.date_range(start=row.start_date, end=row.end_date)
return pd.Series(data=row.value, index=dates)
out = []
for index, row in data.iterrows():
out.append(turn_data_into_date_range(row))
result = pd.concat(out, axis=1).sum(axis=1)
result
2021-01-03 4.0
2021-01-04 10.0
2021-01-05 10.0
2021-01-06 10.0
2021-01-07 17.0
2021-01-08 17.0
2021-01-09 17.0
2021-01-10 17.0
2021-01-11 17.0
2021-01-12 22.0
2021-01-13 16.0
2021-01-14 11.0
2021-01-15 11.0
2021-01-16 7.0
Freq: D, dtype: float64
PS: the answer to this related question doesn’t work in my case, as they have non-overlapping observations and can use a left join: Convert Date Ranges to Time Series in Pandas
Answers:
You can use explode
to break each range into individual days:
data['day'] = data.apply(lambda row: pd.date_range(row['start_date'], row['end_date']), axis=1)
result = data[['day', 'value']].explode('day').groupby('day').sum()
What would be a way to speed things up?
You did
for index, row in data.iterrows():
out.append(turn_data_into_date_range(row))
practical usage show that it is possible to get speed increase from using .itertuples()
rather than .iterrows()
see Why Pandas itertuples() Is Faster Than iterrows() and How To Make It Even Faster. I suggest reworking your code to use said .itertuples()
method. I do not have ability to test it now, but I suspect your turn_data_into_date_range
function might work without any changes, as said tuples support access via dot attribute.
I feel this problem comes back regularly as it’s not an easy thing to do. Some techniques would probably transform each row into a date range or otherwise iterate on rows. In this case there’s a smarter workaround, which is to use cumulative sums, then reindex.
>>> starts = data.set_index('start_date')['value'].sort_index().cumsum()
>>> starts
start_date
2021-01-03 4
2021-01-04 10
2021-01-07 17
2021-01-12 22
Name: value, dtype: int64
>>> ends = data.set_index('end_date')['value'].sort_index().cumsum()
>>> ends
end_date
2021-01-12 6
2021-01-13 11
2021-01-15 15
2021-01-16 22
Name: value, dtype: int64
In case your dates are not unique, you could group by by date and sum first. Then the series definitions are as follows:
>>> starts = data.groupby('start_date')['value'].sum().sort_index().cumsum()
>>> ends = data.groupby('end_date')['value'].sum().sort_index().cumsum()
Note that here we don’t need the set_index()
anymore which is done by sum()
as it is an aggregation, contrarily to .cumsum()
which is a transform operation.
Of course if the ends
are inclusive you might need to add a .shift()
:
>>> dates = pd.date_range(starts.index.min(), ends.index.max())
>>> ends.reindex(dates).ffill().shift().fillna(0)
2021-01-03 0.0
2021-01-04 0.0
2021-01-05 0.0
2021-01-06 0.0
2021-01-07 0.0
2021-01-08 0.0
2021-01-09 0.0
2021-01-10 0.0
2021-01-11 0.0
2021-01-12 0.0
2021-01-13 6.0
2021-01-14 11.0
2021-01-15 11.0
2021-01-16 15.0
Freq: D, Name: value, dtype: float64
Then just subtract the (possibly shifted) ends from the starts:
>>> starts.reindex(dates).ffill() - ends.reindex(dates).ffill().shift().fillna(0)
2021-01-03 4.0
2021-01-04 10.0
2021-01-05 10.0
2021-01-06 10.0
2021-01-07 17.0
2021-01-08 17.0
2021-01-09 17.0
2021-01-10 17.0
2021-01-11 17.0
2021-01-12 22.0
2021-01-13 16.0
2021-01-14 11.0
2021-01-15 11.0
2021-01-16 7.0
Freq: D, Name: value, dtype: float64
One option is to solve this with an inequality join, using the conditional_join from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
# Build a Series of the all dates:
dates = data.filter(like='date')
start = dates.min().min()
end = dates.max().max()
dates = pd.date_range(start, end, freq='D', name = 'dates')
dates = pd.Series(dates)
(data
.conditional_join(
dates,
('start_date', 'dates', '<='),
('end_date', 'dates', '>='),
# depending on the data size,
# numba offers more performance
use_numba=False,
df_columns='value')
.groupby('dates')
.sum()
)
value
dates
2021-01-03 4
2021-01-04 10
2021-01-05 10
2021-01-06 10
2021-01-07 17
2021-01-08 17
2021-01-09 17
2021-01-10 17
2021-01-11 17
2021-01-12 22
2021-01-13 16
2021-01-14 11
2021-01-15 11
2021-01-16 7
I have observations with start and end date of the following format:
import pandas as pd
data = pd.DataFrame({
'start_date':pd.to_datetime(['2021-01-07','2021-01-04','2021-01-12','2021-01-03']),
'end_date':pd.to_datetime(['2021-01-16','2021-01-12','2021-01-13','2021-01-15']),
'value':[7,6,5,4]
})
data
start_date end_date value
0 2021-01-07 2021-01-16 7
1 2021-01-04 2021-01-12 6
2 2021-01-12 2021-01-13 5
3 2021-01-03 2021-01-15 4
The date ranges between observations overlap. I would like to compute the daily sum aggregated across all observations.
My version with a loop (below) is slow and crashes for ~100k observations. What would be a way to speed things up?
def turn_data_into_date_range(row):
dates = pd.date_range(start=row.start_date, end=row.end_date)
return pd.Series(data=row.value, index=dates)
out = []
for index, row in data.iterrows():
out.append(turn_data_into_date_range(row))
result = pd.concat(out, axis=1).sum(axis=1)
result
2021-01-03 4.0
2021-01-04 10.0
2021-01-05 10.0
2021-01-06 10.0
2021-01-07 17.0
2021-01-08 17.0
2021-01-09 17.0
2021-01-10 17.0
2021-01-11 17.0
2021-01-12 22.0
2021-01-13 16.0
2021-01-14 11.0
2021-01-15 11.0
2021-01-16 7.0
Freq: D, dtype: float64
PS: the answer to this related question doesn’t work in my case, as they have non-overlapping observations and can use a left join: Convert Date Ranges to Time Series in Pandas
You can use explode
to break each range into individual days:
data['day'] = data.apply(lambda row: pd.date_range(row['start_date'], row['end_date']), axis=1)
result = data[['day', 'value']].explode('day').groupby('day').sum()
What would be a way to speed things up?
You did
for index, row in data.iterrows():
out.append(turn_data_into_date_range(row))
practical usage show that it is possible to get speed increase from using .itertuples()
rather than .iterrows()
see Why Pandas itertuples() Is Faster Than iterrows() and How To Make It Even Faster. I suggest reworking your code to use said .itertuples()
method. I do not have ability to test it now, but I suspect your turn_data_into_date_range
function might work without any changes, as said tuples support access via dot attribute.
I feel this problem comes back regularly as it’s not an easy thing to do. Some techniques would probably transform each row into a date range or otherwise iterate on rows. In this case there’s a smarter workaround, which is to use cumulative sums, then reindex.
>>> starts = data.set_index('start_date')['value'].sort_index().cumsum()
>>> starts
start_date
2021-01-03 4
2021-01-04 10
2021-01-07 17
2021-01-12 22
Name: value, dtype: int64
>>> ends = data.set_index('end_date')['value'].sort_index().cumsum()
>>> ends
end_date
2021-01-12 6
2021-01-13 11
2021-01-15 15
2021-01-16 22
Name: value, dtype: int64
In case your dates are not unique, you could group by by date and sum first. Then the series definitions are as follows:
>>> starts = data.groupby('start_date')['value'].sum().sort_index().cumsum()
>>> ends = data.groupby('end_date')['value'].sum().sort_index().cumsum()
Note that here we don’t need the set_index()
anymore which is done by sum()
as it is an aggregation, contrarily to .cumsum()
which is a transform operation.
Of course if the ends
are inclusive you might need to add a .shift()
:
>>> dates = pd.date_range(starts.index.min(), ends.index.max())
>>> ends.reindex(dates).ffill().shift().fillna(0)
2021-01-03 0.0
2021-01-04 0.0
2021-01-05 0.0
2021-01-06 0.0
2021-01-07 0.0
2021-01-08 0.0
2021-01-09 0.0
2021-01-10 0.0
2021-01-11 0.0
2021-01-12 0.0
2021-01-13 6.0
2021-01-14 11.0
2021-01-15 11.0
2021-01-16 15.0
Freq: D, Name: value, dtype: float64
Then just subtract the (possibly shifted) ends from the starts:
>>> starts.reindex(dates).ffill() - ends.reindex(dates).ffill().shift().fillna(0)
2021-01-03 4.0
2021-01-04 10.0
2021-01-05 10.0
2021-01-06 10.0
2021-01-07 17.0
2021-01-08 17.0
2021-01-09 17.0
2021-01-10 17.0
2021-01-11 17.0
2021-01-12 22.0
2021-01-13 16.0
2021-01-14 11.0
2021-01-15 11.0
2021-01-16 7.0
Freq: D, Name: value, dtype: float64
One option is to solve this with an inequality join, using the conditional_join from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
# Build a Series of the all dates:
dates = data.filter(like='date')
start = dates.min().min()
end = dates.max().max()
dates = pd.date_range(start, end, freq='D', name = 'dates')
dates = pd.Series(dates)
(data
.conditional_join(
dates,
('start_date', 'dates', '<='),
('end_date', 'dates', '>='),
# depending on the data size,
# numba offers more performance
use_numba=False,
df_columns='value')
.groupby('dates')
.sum()
)
value
dates
2021-01-03 4
2021-01-04 10
2021-01-05 10
2021-01-06 10
2021-01-07 17
2021-01-08 17
2021-01-09 17
2021-01-10 17
2021-01-11 17
2021-01-12 22
2021-01-13 16
2021-01-14 11
2021-01-15 11
2021-01-16 7