PANDAS group by with 30 minute intervals and calculate total difference
Question:
I have a data frame that looks like this:
date
week
id
20/07/21 12:46:00
1
d1
20/07/21 12:56:00
1
d1
20/07/21 13:09:00
1
d1
20/07/21 14:11:00
1
d1
20/07/21 14:42:00
1
d1
I want to group by date in in 30 minutes interval- so if 2 consecutive rows are more than 30 minutes apart they are on different groups.
The output I need looks like this:
week
id
min_date
max_date
1
d1
20/07/21 12:46:00
20/07/21 13:09:00
1
d1
20/07/21 14:11:00
20/07/21 14:11:00
1
d1
20/07/21 14:42:00
20/07/21 14:42:00
I used this code in order to group by:
x=df.groupby(['id','week', pd.Grouper(key='date', freq='30min',origin="start")]).agg({'date':[np.min, np.max]})
Something isn’t working with the grouper, any suggestions how to improve it?
EDIT:
Here’s an example of my data that causes an issue:
date
week
id
20/07/21 12:46:00
1
d1
20/07/21 12:56:00
1
d1
20/07/21 13:09:00
1
d1
22/07/21 07:11:00
1
d1
22/07/21 07:14:00
1
d1
22/07/21 07:27:00
1
d1
22/07/21 08:34:00
1
d1
22/07/21 08:36:00
1
d1
The output required is:
week
id
min_date
max_date
1
d1
20/07/21 12:46:00
20/07/21 13:09:00
1
d1
20/07/21 07:11:00
20/07/21 07:27:00
1
d1
20/07/21 08:34:00
20/07/21 08:36:00
This is the output I get:
week
id
min_date
max_date
1
d1
20/07/21 12:46:00
20/07/21 13:09:00
1
d1
20/07/21 07:11:00
20/07/21 08:36:00
I don’t understand why it groups the last rows together when there is more than an hour difference between 20/07/21 07:27:00 and 20/07/21 08:34:00.
Thanks!
Answers:
You can use:
df['date'] = pd.to_datetime(df['date'])
(df.groupby(df['date'].diff().gt(pd.Timedelta('30min')).cumsum())
['date'].agg(['min', 'max'])
)
Or maybe also group by id and week:
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['week', 'id', df['date'].diff().gt(pd.Timedelta('30min')).cumsum()])
['date'].agg(['min', 'max'])
.droplevel(-1).reset_index()
)
Output:
week id min max
0 1 d1 2021-07-20 12:46:00 2021-07-20 13:09:00
1 1 d1 2021-07-20 14:11:00 2021-07-20 14:11:00
2 1 d1 2021-07-20 14:42:00 2021-07-20 14:42:00
I have a data frame that looks like this:
date | week | id |
---|---|---|
20/07/21 12:46:00 | 1 | d1 |
20/07/21 12:56:00 | 1 | d1 |
20/07/21 13:09:00 | 1 | d1 |
20/07/21 14:11:00 | 1 | d1 |
20/07/21 14:42:00 | 1 | d1 |
I want to group by date in in 30 minutes interval- so if 2 consecutive rows are more than 30 minutes apart they are on different groups.
The output I need looks like this:
week | id | min_date | max_date |
---|---|---|---|
1 | d1 | 20/07/21 12:46:00 | 20/07/21 13:09:00 |
1 | d1 | 20/07/21 14:11:00 | 20/07/21 14:11:00 |
1 | d1 | 20/07/21 14:42:00 | 20/07/21 14:42:00 |
I used this code in order to group by:
x=df.groupby(['id','week', pd.Grouper(key='date', freq='30min',origin="start")]).agg({'date':[np.min, np.max]})
Something isn’t working with the grouper, any suggestions how to improve it?
EDIT:
Here’s an example of my data that causes an issue:
date | week | id |
---|---|---|
20/07/21 12:46:00 | 1 | d1 |
20/07/21 12:56:00 | 1 | d1 |
20/07/21 13:09:00 | 1 | d1 |
22/07/21 07:11:00 | 1 | d1 |
22/07/21 07:14:00 | 1 | d1 |
22/07/21 07:27:00 | 1 | d1 |
22/07/21 08:34:00 | 1 | d1 |
22/07/21 08:36:00 | 1 | d1 |
The output required is:
week | id | min_date | max_date |
---|---|---|---|
1 | d1 | 20/07/21 12:46:00 | 20/07/21 13:09:00 |
1 | d1 | 20/07/21 07:11:00 | 20/07/21 07:27:00 |
1 | d1 | 20/07/21 08:34:00 | 20/07/21 08:36:00 |
This is the output I get:
week | id | min_date | max_date |
---|---|---|---|
1 | d1 | 20/07/21 12:46:00 | 20/07/21 13:09:00 |
1 | d1 | 20/07/21 07:11:00 | 20/07/21 08:36:00 |
I don’t understand why it groups the last rows together when there is more than an hour difference between 20/07/21 07:27:00 and 20/07/21 08:34:00.
Thanks!
You can use:
df['date'] = pd.to_datetime(df['date'])
(df.groupby(df['date'].diff().gt(pd.Timedelta('30min')).cumsum())
['date'].agg(['min', 'max'])
)
Or maybe also group by id and week:
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['week', 'id', df['date'].diff().gt(pd.Timedelta('30min')).cumsum()])
['date'].agg(['min', 'max'])
.droplevel(-1).reset_index()
)
Output:
week id min max
0 1 d1 2021-07-20 12:46:00 2021-07-20 13:09:00
1 1 d1 2021-07-20 14:11:00 2021-07-20 14:11:00
2 1 d1 2021-07-20 14:42:00 2021-07-20 14:42:00