Creating sum of date ranges in Pandas
Question:
I have the following DataFrame, with over 3 million rows:
VALID_FROM VALID_TO VALUE
0 2022-01-01 2022-01-02 5
1 2022-01-01 2022-01-03 2
2 2022-01-02 2022-01-04 7
3 2022-01-03 2022-01-06 3
I want to create one large date_range with a sum of the values for each timestamp.
For the DataFrame above that would come out to:
dates val
0 2022-01-01 7
1 2022-01-02 14
2 2022-01-03 12
3 2022-01-04 10
4 2022-01-05 3
5 2022-01-06 3
However, as the DataFrame has a little over 3 Million rows I don’t want to iterate over each row and I’m not sure how to do this without iterating. Any suggestions?
Currently my code looks like this:
new_df = pd.DataFrame()
for idx, row in dummy_df.iterrows():
dr = pd.date_range(row["VALID_FROM"], end = row["VALID_TO"], freq = "D")
tmp_df = pd.DataFrame({"dates": dr, "val": row["VALUE"]})
new_df = pd.concat(objs=[new_df, tmp_df], ignore_index=True)
new_df.groupby("dates", as_index=False, group_keys=False).sum()
The result of the groupby would be my desired output.
Answers:
If performance is important use Index.repeat
with DataFrame.loc
for new rows, create date
colun with counter by GroupBy.cumcount
and last aggregate sum
:
df['VALID_FROM'] = pd.to_datetime(df['VALID_FROM'])
df['VALID_TO'] = pd.to_datetime(df['VALID_TO'])
df1 = df.loc[df.index.repeat(df['VALID_TO'].sub(df['VALID_FROM']).dt.days + 1)]
df1['dates'] = df1['VALID_FROM'] + pd.to_timedelta(df1.groupby(level=0).cumcount(),unit='d')
df1 = df1.groupby('dates', as_index=False)['VALUE'].sum()
print (df1)
dates VALUE
0 2022-01-01 7
1 2022-01-02 14
2 2022-01-03 12
3 2022-01-04 10
4 2022-01-05 3
5 2022-01-06 3
One option is to build a list of dates, from the min to the max from the original dataframe, use a non-equi join with conditional_join to get matches, and finally groupby and sum:
# pip install pyjanitor
import pandas as pd
import janitor
# build the date pandas object:
dates = df.filter(like='VALID').to_numpy()
dates = pd.date_range(dates.min(), dates.max(), freq='1D')
dates = pd.Series(dates, name='dates')
# compute the inequality join between valid_from and valid_to,
# followed by the aggregation on a groupby:
(df
.conditional_join(
dates,
('VALID_FROM', 'dates', '<='),
('VALID_TO','dates', '>='),
# if you have numba installed,
# it can improve performance
use_numba=False,
df_columns='VALUE')
.groupby('dates')
.VALUE
.sum()
)
dates
2022-01-01 7
2022-01-02 14
2022-01-03 12
2022-01-04 10
2022-01-05 3
2022-01-06 3
Name: VALUE, dtype: int64
I have the following DataFrame, with over 3 million rows:
VALID_FROM VALID_TO VALUE
0 2022-01-01 2022-01-02 5
1 2022-01-01 2022-01-03 2
2 2022-01-02 2022-01-04 7
3 2022-01-03 2022-01-06 3
I want to create one large date_range with a sum of the values for each timestamp.
For the DataFrame above that would come out to:
dates val
0 2022-01-01 7
1 2022-01-02 14
2 2022-01-03 12
3 2022-01-04 10
4 2022-01-05 3
5 2022-01-06 3
However, as the DataFrame has a little over 3 Million rows I don’t want to iterate over each row and I’m not sure how to do this without iterating. Any suggestions?
Currently my code looks like this:
new_df = pd.DataFrame()
for idx, row in dummy_df.iterrows():
dr = pd.date_range(row["VALID_FROM"], end = row["VALID_TO"], freq = "D")
tmp_df = pd.DataFrame({"dates": dr, "val": row["VALUE"]})
new_df = pd.concat(objs=[new_df, tmp_df], ignore_index=True)
new_df.groupby("dates", as_index=False, group_keys=False).sum()
The result of the groupby would be my desired output.
If performance is important use Index.repeat
with DataFrame.loc
for new rows, create date
colun with counter by GroupBy.cumcount
and last aggregate sum
:
df['VALID_FROM'] = pd.to_datetime(df['VALID_FROM'])
df['VALID_TO'] = pd.to_datetime(df['VALID_TO'])
df1 = df.loc[df.index.repeat(df['VALID_TO'].sub(df['VALID_FROM']).dt.days + 1)]
df1['dates'] = df1['VALID_FROM'] + pd.to_timedelta(df1.groupby(level=0).cumcount(),unit='d')
df1 = df1.groupby('dates', as_index=False)['VALUE'].sum()
print (df1)
dates VALUE
0 2022-01-01 7
1 2022-01-02 14
2 2022-01-03 12
3 2022-01-04 10
4 2022-01-05 3
5 2022-01-06 3
One option is to build a list of dates, from the min to the max from the original dataframe, use a non-equi join with conditional_join to get matches, and finally groupby and sum:
# pip install pyjanitor
import pandas as pd
import janitor
# build the date pandas object:
dates = df.filter(like='VALID').to_numpy()
dates = pd.date_range(dates.min(), dates.max(), freq='1D')
dates = pd.Series(dates, name='dates')
# compute the inequality join between valid_from and valid_to,
# followed by the aggregation on a groupby:
(df
.conditional_join(
dates,
('VALID_FROM', 'dates', '<='),
('VALID_TO','dates', '>='),
# if you have numba installed,
# it can improve performance
use_numba=False,
df_columns='VALUE')
.groupby('dates')
.VALUE
.sum()
)
dates
2022-01-01 7
2022-01-02 14
2022-01-03 12
2022-01-04 10
2022-01-05 3
2022-01-06 3
Name: VALUE, dtype: int64