Pandas cumsum on a groupby object with a condition

Question:

For the given dataset below, I am able to calculate "distinct" visits to a hotel, which is defined as any visit where the traveler arrives at least one full calendar day (i.e., 2 or more days) since a previous departure. Any arrival that is less than one full calendar day since the previous visit will be merged with the previous visit. The negative days_between_visits are simulating some data errors that are found throughout the dataset.

I get the new start date by running: df["NewStartDate"] = df.groupby((df["days_between_visits"]>=2).cumsum().shift().fillna(0))["arrival_date "].transform("min")

and get the new end date by running: df["NewEndDate"] = df.groupby((df["days_between_visits"]>=2).cumsum().shift().fillna(0))["departure_date "].transform("max")

Check here for additional info on NewStartDate and NewEndDate were calculated.

ID	arrival_date	departure_date	days_between_visits
1	2010-04-01	2010-04-02	0
1	2010-04-02	2010-04-02	1
1	2010-04-03	2010-04-03	1
1	2010-04-04	2010-04-04	1
1	2010-04-05	2010-04-05	15
1	2010-04-20	2010-04-21	-1
1	2010-04-20	2010-04-22	-2
1	2010-04-20	2010-04-27	39
1	2010-06-05	2010-06-05	1
1	2010-06-06	2010-06-06	2
1	2010-06-08	2010-09-22	375
1	2011-10-01	2011-10-02	0
1	2011-10-02	2011-10-04	1
1	2011-10-05	2011-10-06	87
1	2012-01-01	2012-01-02

The problem: I don’t understand what is happening on the conditional groupby with the .cumsum(). The conditional groupby (on its own) returns two groups (i.e., those with <2 days between visits (i.e., False group) and those with >=2 days between visits (i.e., True group). When adding the .cumsum() to the process, we get a cumulative sum on the days_between_visits within each group BUT only after new groups are created, which are split where the condition in the groupby is True (i.e., each row where days_between_visits >= 2 is the start of a new group). Can somebody explain what is happening here and why the .cumsum() is resulting in more groups?

Final result:

ID	arrival_date	departure_date	days_between_visits	NewStartDate	NewEndDate
1	2010-04-01	2010-04-02	0	2010-04-01	2010-04-05
1	2010-04-02	2010-04-02	1	2010-04-01	2010-04-05
1	2010-04-03	2010-04-03	1	2010-04-01	2010-04-05
1	2010-04-04	2010-04-04	1	2010-04-01	2010-04-05
1	2010-04-05	2010-04-05	15	2010-04-01	2010-04-05
1	2010-04-20	2010-04-21	-1	2010-04-20	2010-04-27
1	2010-04-20	2010-04-22	-2	2010-04-20	2010-04-27
1	2010-04-20	2010-04-27	39	2010-04-20	2010-04-27
1	2010-06-05	2010-06-05	1	2010-06-05	2010-06-06
1	2010-06-06	2010-06-06	2	2010-06-05	2010-06-06
1	2010-06-08	2010-09-22	375	2010-06-08	2010-09-22
1	2011-10-01	2011-10-02	0	2011-10-01	2011-10-06
1	2011-10-02	2011-10-04	1	2011-10-01	2011-10-06
1	2011-10-05	2011-10-06	87	2011-10-01	2011-10-06
1	2012-01-01	2012-01-02		2012-01-01	2012-01-02

Asked By: adm-gis

Source

Answers:

Let’s do this piece by piece: (df["days_between_visits"]>=2) returns a boolean series. We treat this series as an int64 because pandas allows for some integer operations to be performed on boolean series (with False == 0 and True == 1).

When we perform the .cumsum() on the boolean series, we get the following:

However, because the True values are indications of the last row for a given visit to the hotel (i.e., >=2 days until the next visit visit), we need to include the row that contains the True (i.e., 1) value into the visit by shifting (i.e., shift()) the data down one row producing the following:

Now we can fillna(0) and we have our distinct groups. We can now groupby this newly created series and perform our transform functions on the arrival_date and departure_date as needed.

Answered By: adm-gis