Dataframe groupby to get stints
Question:
I have a new code I’m trying to write where a dataframe gets filtered/edited to obtain "stints" for each individual. Using the dataframe below as an example, I’m basically trying to get each persons start/end dates for a given location. Usually I can get started on my own but I’m stumped as to how to approach this so if anyone has ideas I would greatly appreciate it.
Person
Location
Date
0
Tom
A
1/1/2021
1
Tom
A
1/2/2021
2
Tom
A
1/3/2021
3
Tom
B
1/4/2021
4
Tom
B
1/5/2021
5
Tom
B
1/6/2021
6
Tom
A
1/7/2021
7
Tom
A
1/8/2021
8
Tom
A
1/9/2021
9
Tom
C
1/10/2021
10
Tom
C
1/11/2021
11
Tom
A
1/12/2021
12
Tom
A
1/13/2021
13
Tom
B
1/14/2021
14
Tom
B
1/15/2021
15
Mark
A
1/1/2021
16
Mark
A
1/2/2021
17
Mark
B
1/3/2021
18
Mark
B
1/4/2021
19
Mark
A
1/5/2021
20
Mark
A
1/6/2021
21
Mark
C
1/7/2021
22
Mark
C
1/8/2021
23
Mark
C
1/9/2021
24
Mark
C
1/10/2021
25
Mark
A
1/11/2021
26
Mark
A
1/12/2021
27
Mark
B
1/13/2021
28
Mark
B
1/14/2021
29
Mark
B
1/15/2021
Expected outcome:
Person
Location
StintNum
Start_Date
End Date
0
Tom
A
1
1/1/2021
1/3/2021
1
Tom
B
2
1/4/2021
1/6/2021
2
Tom
A
3
1/7/2021
1/9/2021
3
Tom
C
4
1/10/2021
1/11/2021
4
Tom
A
5
1/12/2021
1/13/2021
5
Tom
B
6
1/14/2021
1/15/2021
6
Mark
A
1
1/1/2021
1/2/2021
7
Mark
B
2
1/3/2021
1/4/2021
8
Mark
A
3
1/5/2021
1/6/2021
9
Mark
C
4
1/7/2021
1/10/2021
10
Mark
A
5
1/11/2021
1/12/2021
11
Mark
B
6
1/13/2021
1/15/2021
Answers:
Try this:
df['Date'] = pd.to_datetime(df['Date'])
new_df = df.groupby([df['Person'], df['Location'].ne(df['Location'].shift(1)).cumsum()], sort=False).apply(lambda x: pd.Series([x['Date'].min(), x['Date'].max()], index=['Start_Date','End_Date'])).reset_index()
new_df['StintNum'] = new_df.groupby('Person').cumcount().add(1)
Output:
>>> new_df
Person Location Start_Date End_Date StintNum
0 Tom 1 2021-01-01 2021-01-03 1
1 Tom 2 2021-01-04 2021-01-06 2
2 Tom 3 2021-01-07 2021-01-09 3
3 Tom 4 2021-01-10 2021-01-11 4
4 Tom 5 2021-01-12 2021-01-13 5
5 Tom 6 2021-01-14 2021-01-15 6
6 Mark 7 2021-01-01 2021-01-02 1
7 Mark 8 2021-01-03 2021-01-04 2
8 Mark 9 2021-01-05 2021-01-06 3
9 Mark 10 2021-01-07 2021-01-10 4
10 Mark 11 2021-01-11 2021-01-12 5
11 Mark 12 2021-01-13 2021-01-15 6
IMO, a clean way is to use groupby
+agg
, this enables to set custom aggregators easily and is faster than apply
:
df['Date'] = pd.to_datetime(df['Date'])
group = df['Location'].ne(df['Location'].shift()).cumsum()
df2 = (
df.groupby(['Person', group], as_index=False)
.agg(Location=('Location', 'first'),
# line below is a dummy function to set a column placeholder
# uncomment it you want the columns in order
#StintNum=('Location', lambda x: float('NaN')),
Start_Date=('Date', 'min'),
End_Date=('Date', 'max'),
)
)
df2['StintNum'] = df2.groupby('Person').cumcount().add(1)
Output:
Person Location StintNum Start_Date End_Date
0 Mark A 1 2021-01-01 2021-01-02
1 Mark B 2 2021-01-03 2021-01-04
2 Mark A 3 2021-01-05 2021-01-06
3 Mark C 4 2021-01-07 2021-01-10
4 Mark A 5 2021-01-11 2021-01-12
5 Mark B 6 2021-01-13 2021-01-15
6 Tom A 1 2021-01-01 2021-01-03
7 Tom B 2 2021-01-04 2021-01-06
8 Tom A 3 2021-01-07 2021-01-09
9 Tom C 4 2021-01-10 2021-01-11
10 Tom A 5 2021-01-12 2021-01-13
11 Tom B 6 2021-01-14 2021-01-15
def function1(dd:pd.DataFrame):
dd1=dd.groupby(["Person","Location"],as_index=False).agg(StintNum=("col1","first"),Start_Date=("Date",min),End_Date=("Date",max))
return dd1
col1=df1.groupby("Person",group_keys=False).apply(lambda dd:dd.Location.ne(dd.Location.shift()).cumsum())
df1.assign(col1=col1).groupby(["Person","col1"],sort=False).apply(function1).reset_index(drop=True)
output:
Person Location StintNum Start_Date End_Date
0 Tom A 1 1/1/2021 1/3/2021
1 Tom B 2 1/4/2021 1/6/2021
2 Tom A 3 1/7/2021 1/9/2021
3 Tom C 4 1/10/2021 1/11/2021
4 Tom A 5 1/12/2021 1/13/2021
5 Tom B 6 1/14/2021 1/15/2021
6 Mark A 1 1/1/2021 1/2/2021
7 Mark B 2 1/3/2021 1/4/2021
8 Mark A 3 1/5/2021 1/6/2021
9 Mark C 4 1/10/2021 1/9/2021
10 Mark A 5 1/11/2021 1/12/2021
11 Mark B 6 1/13/2021 1/15/2021
I have a new code I’m trying to write where a dataframe gets filtered/edited to obtain "stints" for each individual. Using the dataframe below as an example, I’m basically trying to get each persons start/end dates for a given location. Usually I can get started on my own but I’m stumped as to how to approach this so if anyone has ideas I would greatly appreciate it.
Person | Location | Date | |
---|---|---|---|
0 | Tom | A | 1/1/2021 |
1 | Tom | A | 1/2/2021 |
2 | Tom | A | 1/3/2021 |
3 | Tom | B | 1/4/2021 |
4 | Tom | B | 1/5/2021 |
5 | Tom | B | 1/6/2021 |
6 | Tom | A | 1/7/2021 |
7 | Tom | A | 1/8/2021 |
8 | Tom | A | 1/9/2021 |
9 | Tom | C | 1/10/2021 |
10 | Tom | C | 1/11/2021 |
11 | Tom | A | 1/12/2021 |
12 | Tom | A | 1/13/2021 |
13 | Tom | B | 1/14/2021 |
14 | Tom | B | 1/15/2021 |
15 | Mark | A | 1/1/2021 |
16 | Mark | A | 1/2/2021 |
17 | Mark | B | 1/3/2021 |
18 | Mark | B | 1/4/2021 |
19 | Mark | A | 1/5/2021 |
20 | Mark | A | 1/6/2021 |
21 | Mark | C | 1/7/2021 |
22 | Mark | C | 1/8/2021 |
23 | Mark | C | 1/9/2021 |
24 | Mark | C | 1/10/2021 |
25 | Mark | A | 1/11/2021 |
26 | Mark | A | 1/12/2021 |
27 | Mark | B | 1/13/2021 |
28 | Mark | B | 1/14/2021 |
29 | Mark | B | 1/15/2021 |
Expected outcome:
Person | Location | StintNum | Start_Date | End Date | |
---|---|---|---|---|---|
0 | Tom | A | 1 | 1/1/2021 | 1/3/2021 |
1 | Tom | B | 2 | 1/4/2021 | 1/6/2021 |
2 | Tom | A | 3 | 1/7/2021 | 1/9/2021 |
3 | Tom | C | 4 | 1/10/2021 | 1/11/2021 |
4 | Tom | A | 5 | 1/12/2021 | 1/13/2021 |
5 | Tom | B | 6 | 1/14/2021 | 1/15/2021 |
6 | Mark | A | 1 | 1/1/2021 | 1/2/2021 |
7 | Mark | B | 2 | 1/3/2021 | 1/4/2021 |
8 | Mark | A | 3 | 1/5/2021 | 1/6/2021 |
9 | Mark | C | 4 | 1/7/2021 | 1/10/2021 |
10 | Mark | A | 5 | 1/11/2021 | 1/12/2021 |
11 | Mark | B | 6 | 1/13/2021 | 1/15/2021 |
Try this:
df['Date'] = pd.to_datetime(df['Date'])
new_df = df.groupby([df['Person'], df['Location'].ne(df['Location'].shift(1)).cumsum()], sort=False).apply(lambda x: pd.Series([x['Date'].min(), x['Date'].max()], index=['Start_Date','End_Date'])).reset_index()
new_df['StintNum'] = new_df.groupby('Person').cumcount().add(1)
Output:
>>> new_df
Person Location Start_Date End_Date StintNum
0 Tom 1 2021-01-01 2021-01-03 1
1 Tom 2 2021-01-04 2021-01-06 2
2 Tom 3 2021-01-07 2021-01-09 3
3 Tom 4 2021-01-10 2021-01-11 4
4 Tom 5 2021-01-12 2021-01-13 5
5 Tom 6 2021-01-14 2021-01-15 6
6 Mark 7 2021-01-01 2021-01-02 1
7 Mark 8 2021-01-03 2021-01-04 2
8 Mark 9 2021-01-05 2021-01-06 3
9 Mark 10 2021-01-07 2021-01-10 4
10 Mark 11 2021-01-11 2021-01-12 5
11 Mark 12 2021-01-13 2021-01-15 6
IMO, a clean way is to use groupby
+agg
, this enables to set custom aggregators easily and is faster than apply
:
df['Date'] = pd.to_datetime(df['Date'])
group = df['Location'].ne(df['Location'].shift()).cumsum()
df2 = (
df.groupby(['Person', group], as_index=False)
.agg(Location=('Location', 'first'),
# line below is a dummy function to set a column placeholder
# uncomment it you want the columns in order
#StintNum=('Location', lambda x: float('NaN')),
Start_Date=('Date', 'min'),
End_Date=('Date', 'max'),
)
)
df2['StintNum'] = df2.groupby('Person').cumcount().add(1)
Output:
Person Location StintNum Start_Date End_Date
0 Mark A 1 2021-01-01 2021-01-02
1 Mark B 2 2021-01-03 2021-01-04
2 Mark A 3 2021-01-05 2021-01-06
3 Mark C 4 2021-01-07 2021-01-10
4 Mark A 5 2021-01-11 2021-01-12
5 Mark B 6 2021-01-13 2021-01-15
6 Tom A 1 2021-01-01 2021-01-03
7 Tom B 2 2021-01-04 2021-01-06
8 Tom A 3 2021-01-07 2021-01-09
9 Tom C 4 2021-01-10 2021-01-11
10 Tom A 5 2021-01-12 2021-01-13
11 Tom B 6 2021-01-14 2021-01-15
def function1(dd:pd.DataFrame):
dd1=dd.groupby(["Person","Location"],as_index=False).agg(StintNum=("col1","first"),Start_Date=("Date",min),End_Date=("Date",max))
return dd1
col1=df1.groupby("Person",group_keys=False).apply(lambda dd:dd.Location.ne(dd.Location.shift()).cumsum())
df1.assign(col1=col1).groupby(["Person","col1"],sort=False).apply(function1).reset_index(drop=True)
output:
Person Location StintNum Start_Date End_Date
0 Tom A 1 1/1/2021 1/3/2021
1 Tom B 2 1/4/2021 1/6/2021
2 Tom A 3 1/7/2021 1/9/2021
3 Tom C 4 1/10/2021 1/11/2021
4 Tom A 5 1/12/2021 1/13/2021
5 Tom B 6 1/14/2021 1/15/2021
6 Mark A 1 1/1/2021 1/2/2021
7 Mark B 2 1/3/2021 1/4/2021
8 Mark A 3 1/5/2021 1/6/2021
9 Mark C 4 1/10/2021 1/9/2021
10 Mark A 5 1/11/2021 1/12/2021
11 Mark B 6 1/13/2021 1/15/2021