How to Split overlapping date ranges into multiple date ranges in Pandas?

Question:

I have following pandas Data Frame.

enter image description here

I want to find overlapped date range and aggregate common company in the same date range. My desired output would be following:

enter image description here

how can I achieve this in pandas? I tried implementing few approaches but was not able to get to cover all scenarios. I would really appreciate any help in this. Thanks

EDIT: Sample Data

df = pd.DataFrame(data={'Company':['A','B','C','D','E','F','G'],'StartDate':['2023-04-01','2023-04-01','2023-04-01','2023-04-01','2023-04-08','2023-04-15','2023-04-20'],'EndDate':['2023-04-14','2023-05-09','2023-05-09','2023-05-09','2023-04-30','2023-05-18','2023-04-30']})
Asked By: PyPyVk

||

Answers:

I’m not sure if this is only applicable to the example given.

end = pd.Series(pd.Index(df['StartDate']).union(df['EndDate']).unique())
>>> end
0   2023-04-01
1   2023-04-08
2   2023-04-14
3   2023-04-15
4   2023-04-20
5   2023-04-30
6   2023-05-09
7   2023-05-18
Name: StartDate, dtype: datetime64[ns]

For the example given, if you minus 1 day from any dates also in StartDate – the end range is generated (after duplicates are removed).

>>> end[end.isin(df['StartDate'])] 
0   2023-04-01
1   2023-04-08
3   2023-04-15
4   2023-04-20
Name: StartDate, dtype: datetime64[ns]

e.g.

first = end.iloc[0]

# Is this only true for the dates in this example?
end[end.isin(df['StartDate'])] -= pd.Timedelta(days=1)
 
end = end.drop_duplicates().tail(-1).reset_index(drop=True)

start = (end + pd.Timedelta(days=1)).shift().reset_index(drop=True)
start[0] = first

df_range = pd.DataFrame({"StartDate": start, "EndDate": end})
>>> df_range
   StartDate    EndDate
0 2023-04-01 2023-04-07
1 2023-04-08 2023-04-14
2 2023-04-15 2023-04-19
3 2023-04-20 2023-04-30
4 2023-05-01 2023-05-09
5 2023-05-10 2023-05-18

From there you can do a "range join", there are many existing answers showing how to do that.

I find SQL offers a simple solution:

import duckdb

# used to keep the company name order
df = df.reset_index()

duckdb.sql("""
   from df t1, df_range t2
   select 
      group_concat(Company order by index) Company,
      t2.*
   where
      t2.StartDate between t1.StartDate and t1.EndDate
   group by 
      t2.StartDate, t2.EndDate
   order by
      t2.StartDate
""").df()
       Company  StartDate    EndDate
0      A,B,C,D 2023-04-01 2023-04-07
1    A,B,C,D,E 2023-04-08 2023-04-14
2    B,C,D,E,F 2023-04-15 2023-04-19
3  B,C,D,E,F,G 2023-04-20 2023-04-30
4      B,C,D,F 2023-05-01 2023-05-09
5            F 2023-05-10 2023-05-18
Answered By: jqurious