How to Split overlapping date ranges into multiple date ranges in Pandas?
Question:
I have following pandas Data Frame.
I want to find overlapped date range and aggregate common company in the same date range. My desired output would be following:
how can I achieve this in pandas? I tried implementing few approaches but was not able to get to cover all scenarios. I would really appreciate any help in this. Thanks
EDIT: Sample Data
df = pd.DataFrame(data={'Company':['A','B','C','D','E','F','G'],'StartDate':['2023-04-01','2023-04-01','2023-04-01','2023-04-01','2023-04-08','2023-04-15','2023-04-20'],'EndDate':['2023-04-14','2023-05-09','2023-05-09','2023-05-09','2023-04-30','2023-05-18','2023-04-30']})
Answers:
I’m not sure if this is only applicable to the example given.
end = pd.Series(pd.Index(df['StartDate']).union(df['EndDate']).unique())
>>> end
0 2023-04-01
1 2023-04-08
2 2023-04-14
3 2023-04-15
4 2023-04-20
5 2023-04-30
6 2023-05-09
7 2023-05-18
Name: StartDate, dtype: datetime64[ns]
For the example given, if you minus 1 day from any dates also in StartDate
– the end range is generated (after duplicates are removed).
>>> end[end.isin(df['StartDate'])]
0 2023-04-01
1 2023-04-08
3 2023-04-15
4 2023-04-20
Name: StartDate, dtype: datetime64[ns]
e.g.
first = end.iloc[0]
# Is this only true for the dates in this example?
end[end.isin(df['StartDate'])] -= pd.Timedelta(days=1)
end = end.drop_duplicates().tail(-1).reset_index(drop=True)
start = (end + pd.Timedelta(days=1)).shift().reset_index(drop=True)
start[0] = first
df_range = pd.DataFrame({"StartDate": start, "EndDate": end})
>>> df_range
StartDate EndDate
0 2023-04-01 2023-04-07
1 2023-04-08 2023-04-14
2 2023-04-15 2023-04-19
3 2023-04-20 2023-04-30
4 2023-05-01 2023-05-09
5 2023-05-10 2023-05-18
From there you can do a "range join", there are many existing answers showing how to do that.
I find SQL offers a simple solution:
import duckdb
# used to keep the company name order
df = df.reset_index()
duckdb.sql("""
from df t1, df_range t2
select
group_concat(Company order by index) Company,
t2.*
where
t2.StartDate between t1.StartDate and t1.EndDate
group by
t2.StartDate, t2.EndDate
order by
t2.StartDate
""").df()
Company StartDate EndDate
0 A,B,C,D 2023-04-01 2023-04-07
1 A,B,C,D,E 2023-04-08 2023-04-14
2 B,C,D,E,F 2023-04-15 2023-04-19
3 B,C,D,E,F,G 2023-04-20 2023-04-30
4 B,C,D,F 2023-05-01 2023-05-09
5 F 2023-05-10 2023-05-18
I have following pandas Data Frame.
I want to find overlapped date range and aggregate common company in the same date range. My desired output would be following:
how can I achieve this in pandas? I tried implementing few approaches but was not able to get to cover all scenarios. I would really appreciate any help in this. Thanks
EDIT: Sample Data
df = pd.DataFrame(data={'Company':['A','B','C','D','E','F','G'],'StartDate':['2023-04-01','2023-04-01','2023-04-01','2023-04-01','2023-04-08','2023-04-15','2023-04-20'],'EndDate':['2023-04-14','2023-05-09','2023-05-09','2023-05-09','2023-04-30','2023-05-18','2023-04-30']})
I’m not sure if this is only applicable to the example given.
end = pd.Series(pd.Index(df['StartDate']).union(df['EndDate']).unique())
>>> end
0 2023-04-01
1 2023-04-08
2 2023-04-14
3 2023-04-15
4 2023-04-20
5 2023-04-30
6 2023-05-09
7 2023-05-18
Name: StartDate, dtype: datetime64[ns]
For the example given, if you minus 1 day from any dates also in StartDate
– the end range is generated (after duplicates are removed).
>>> end[end.isin(df['StartDate'])]
0 2023-04-01
1 2023-04-08
3 2023-04-15
4 2023-04-20
Name: StartDate, dtype: datetime64[ns]
e.g.
first = end.iloc[0]
# Is this only true for the dates in this example?
end[end.isin(df['StartDate'])] -= pd.Timedelta(days=1)
end = end.drop_duplicates().tail(-1).reset_index(drop=True)
start = (end + pd.Timedelta(days=1)).shift().reset_index(drop=True)
start[0] = first
df_range = pd.DataFrame({"StartDate": start, "EndDate": end})
>>> df_range
StartDate EndDate
0 2023-04-01 2023-04-07
1 2023-04-08 2023-04-14
2 2023-04-15 2023-04-19
3 2023-04-20 2023-04-30
4 2023-05-01 2023-05-09
5 2023-05-10 2023-05-18
From there you can do a "range join", there are many existing answers showing how to do that.
I find SQL offers a simple solution:
import duckdb
# used to keep the company name order
df = df.reset_index()
duckdb.sql("""
from df t1, df_range t2
select
group_concat(Company order by index) Company,
t2.*
where
t2.StartDate between t1.StartDate and t1.EndDate
group by
t2.StartDate, t2.EndDate
order by
t2.StartDate
""").df()
Company StartDate EndDate
0 A,B,C,D 2023-04-01 2023-04-07
1 A,B,C,D,E 2023-04-08 2023-04-14
2 B,C,D,E,F 2023-04-15 2023-04-19
3 B,C,D,E,F,G 2023-04-20 2023-04-30
4 B,C,D,F 2023-05-01 2023-05-09
5 F 2023-05-10 2023-05-18