Group by first occurrence of each value in a pandas dataframe
Question:
I have a pandas dataframe that looks like this:
id
user
action
timestamp
1
Jim
start
12/10/2022
2
Jim
start
12/10/2022
3
Jim
end
2/2/2022
4
Linette
start
8/18/2022
5
Linette
start
3/24/2022
6
Linette
end
8/27/2022
7
Rachel
start
2/7/2022
8
Rachel
end
1/4/2023
9
James
start
6/12/2022
10
James
end
5/14/2022
11
James
start
11/28/2022
12
James
start
8/9/2022
13
James
end
2/15/2022
For each user, there can be more than one start event, but only one end. Imagine that they sometimes need to start a book over again, but only finish it once.
What I want is to calculate the time difference between the first start and the end, so keep, for each user, the first occurrence of "start" and "end" in each group.
Any hint?
Answers:
>>> (df.groupby(["user", "action"], sort=False)["timestamp"]
.first()
.droplevel("action")
.diff().iloc[1::2])
user
James 29 days
Jim 311 days
Linette -9 days
Rachel -331 days
Name: timestamp, dtype: timedelta64[ns]
- for "timestamp" of each "user" & "action" pair, get the first occurences
- this will effectively take the first start, and the (only) end
- then drop the carried over "action" level of groupers
- take the difference from ends and starts
- take every other value to get per-user difference
(sort=False ensures during groupby that start’s don’t get mixed up.)
I have a pandas dataframe that looks like this:
id | user | action | timestamp |
---|---|---|---|
1 | Jim | start | 12/10/2022 |
2 | Jim | start | 12/10/2022 |
3 | Jim | end | 2/2/2022 |
4 | Linette | start | 8/18/2022 |
5 | Linette | start | 3/24/2022 |
6 | Linette | end | 8/27/2022 |
7 | Rachel | start | 2/7/2022 |
8 | Rachel | end | 1/4/2023 |
9 | James | start | 6/12/2022 |
10 | James | end | 5/14/2022 |
11 | James | start | 11/28/2022 |
12 | James | start | 8/9/2022 |
13 | James | end | 2/15/2022 |
For each user, there can be more than one start event, but only one end. Imagine that they sometimes need to start a book over again, but only finish it once.
What I want is to calculate the time difference between the first start and the end, so keep, for each user, the first occurrence of "start" and "end" in each group.
Any hint?
>>> (df.groupby(["user", "action"], sort=False)["timestamp"]
.first()
.droplevel("action")
.diff().iloc[1::2])
user
James 29 days
Jim 311 days
Linette -9 days
Rachel -331 days
Name: timestamp, dtype: timedelta64[ns]
- for "timestamp" of each "user" & "action" pair, get the first occurences
- this will effectively take the first start, and the (only) end
- then drop the carried over "action" level of groupers
- take the difference from ends and starts
- take every other value to get per-user difference
(sort=False ensures during groupby that start’s don’t get mixed up.)