Merge records that follow one another within group

Question:

I have the following dataframe:

   A  B start_date   end_date  id
0  1  2 2022-01-01 2022-01-10   1
1  2  2 2022-02-02 2022-02-05   2
2  1  2 2022-01-11 2022-01-15   3
3  2  2 2022-02-06 2022-02-10   4
4  2  2 2022-02-11 2022-02-15   5
5  2  3 2022-01-14 2022-01-17   6
6  2  3 2022-01-19 2022-01-22   7

There are several records that follow one after the other. For example, rows 1 and 3. Row 3 has the same values A and B and starts the next day when row 1 ends. I want to compress this dataframe into the following form:

   A  B start_date   end_date  id
0  1  2 2022-01-01 2022-01-15   1
1  2  2 2022-02-02 2022-02-15   2
2  2  3 2022-01-14 2022-01-17   3
3  2  3 2022-01-19 2022-01-22   4

That is, I save one record where the difference between the start_date of the next record and the end_date of the previous one is 1 day. In this case, end_date is changed to end_date for the last record inside such a sequence.

Asked By: Fissium

||

Answers:

You can use a custom grouper to join the successive dates per group:

df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime)

m = (df['start_date'].sub(df.groupby(['A', 'B'])
                        ['end_date'].shift()
                          .add(pd.Timedelta('1d'))
                        ).ne('0')
      .groupby([df['A'], df['B']]).cumsum()
    )

out = (df
 .groupby(['A', 'B', m], as_index=False)
 .agg({'start_date': 'first', 'end_date': 'last'})
 .assign(id=lambda d: range(1, len(d)+1))
)

Output:

   A  B start_date   end_date  id
0  1  2 2022-01-01 2022-01-15   1
1  2  2 2022-02-02 2022-02-15   2
2  2  3 2022-01-14 2022-01-17   3
3  2  3 2022-01-19 2022-01-22   4
Answered By: mozway
def function1(dd:pd.DataFrame):
    col1=dd.start_date-dd.end_date.shift()
    dd1=dd.assign(col1=col1.ne("1 days").cumsum())
    return dd1.groupby("col1").agg(start_date=("start_date",min),end_date=("end_date",max))

df1.groupby(["A","B"]).apply(function1).reset_index().assign(id=lambda dd:dd.index+1)

out

  A  B  col1 start_date   end_date  id
0  1  2     1 2022-01-01 2022-01-15   1
1  2  2     1 2022-02-02 2022-02-15   2
2  2  3     1 2022-01-14 2022-01-17   3
3  2  3     2 2022-01-19 2022-01-22   4
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.