Group dataframe by the hour of the day in Pandas
Question:
this is my first time here.
My aim is to group the data by the hour of the day, sum the ‘flow’ column for the rows of each group and divide it by 60.
But i’m having some difficulty about group my data by the hour of the day.
This is how my dataframe( over 150.000 rows) looks like:
https://i.stack.imgur.com/i51V2.png
I tried by using this code:
import pandas as pd
import datetime as dt
df = pd.read_csv('staz_1.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
group = df.groupby(df.timestamp.dt.hour)['flow'].sum()/60
But i obtained data grouped only by hours without the distinction of the day, like this: https://i.stack.imgur.com/LBUZq.png
So my question: Is possible to group data by each hour of each day to have a rapresentation like this?
timestamp flow
1 2020-03-30 06:00:00 708.0
2 2020-03-30 07:00:00 862.0
3 2020-03-30 08:00:00 858.0
4 2020-03-30 09:00:00 840.0
5 2020-03-30 10:00:00 835.0
...
Thanks in advance to anyone who replies.
Answers:
Use df.reset_index
df = df.groupby(df.timestamp.dt.hour)['flow'].sum().reset_index()
df['flow'] = df['flow']/60
If I’m understanding your question correctly, it sounds to me like you have data from multiple hours and multiple dates and want each group to be a particular hour on a particular day? If that’s the case, then you’ll want to use two columns in your groupby. Try this:
import pandas as pd
import datetime as dt
df = pd.read_csv('staz_1.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
group = df.groupby([df.timestamp.dt.date, df.timestamp.dt.hour])['flow'].sum()/60
Be aware that this will created a multi-index in the resulting grouped dataframe, which can be tricky to deal with. You can get rid of that by using .reset_index()
on group
.
If you have columns within date time forms, you can make use of pandas functions to help group data more semantically using the .resample()
method.
You can group by any time value, like days
or hours
, so you don’t have to remember more complex syntax like df.groupby([df.timestamp.dt.date, df.timestamp.dt.hour])
. All you need is df.resample("H")
(if your index is already in time date).
Below is if you don’t have your index as a time date type. You’ll need to first specify what you’re aggregating on, which in this case is the timestamp
column.
import pandas as pd
import numpy as np
# Time is a column and created data per minute
df = pd.DataFrame({'timestamp': pd.date_range('2020-03-30', periods=300, freq='T'),
'flow': np.random.randint(60, 1000, 300)})
df
# timestamp flow
# 0 2020-03-30 00:00:00 488
# 1 2020-03-30 00:01:00 996
# 2 2020-03-30 00:02:00 437
# 3 2020-03-30 00:03:00 599
# 4 2020-03-30 00:04:00 405
# .. ... ...
# 295 2020-03-30 04:55:00 302
# 296 2020-03-30 04:56:00 425
# 297 2020-03-30 04:57:00 404
# 298 2020-03-30 04:58:00 987
# 299 2020-03-30 04:59:00 135
#
# [300 rows x 2 columns]
# Returns data frame
df.resample("H", on='timestamp').sum() / 60
# flow
# timestamp
# 2020-03-30 00:00:00 523.350000
# 2020-03-30 01:00:00 548.033333
# 2020-03-30 02:00:00 516.466667
# 2020-03-30 03:00:00 425.533333
# 2020-03-30 04:00:00 490.416667
Below is if you do have it as an index.
# Index is time
df_idx = pd.DataFrame({'flow': np.random.randint(60, 1000, 300)},
index=pd.date_range('2020-03-30', periods=300, freq='T'))
df_idx
# flow
# 2020-03-30 00:00:00 532
# 2020-03-30 00:01:00 341
# 2020-03-30 00:02:00 964
# 2020-03-30 00:03:00 885
# 2020-03-30 00:04:00 186
# ... ...
# 2020-03-30 04:55:00 996
# 2020-03-30 04:56:00 946
# 2020-03-30 04:57:00 510
# 2020-03-30 04:58:00 564
# 2020-03-30 04:59:00 918
#
# [300 rows x 1 columns]
# Returns a series
df_idx['flow'].resample('H').sum() / 60
# 2020-03-30 00:00:00 569.516667
# 2020-03-30 01:00:00 548.050000
# 2020-03-30 02:00:00 505.283333
# 2020-03-30 03:00:00 530.566667
# 2020-03-30 04:00:00 522.383333
# Freq: H, Name: flow, dtype: float64
The pandas documentation page on the .resample()
method is quite useful as well https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html#pandas.DataFrame.resample.
this is my first time here.
My aim is to group the data by the hour of the day, sum the ‘flow’ column for the rows of each group and divide it by 60.
But i’m having some difficulty about group my data by the hour of the day.
This is how my dataframe( over 150.000 rows) looks like:
https://i.stack.imgur.com/i51V2.png
I tried by using this code:
import pandas as pd
import datetime as dt
df = pd.read_csv('staz_1.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
group = df.groupby(df.timestamp.dt.hour)['flow'].sum()/60
But i obtained data grouped only by hours without the distinction of the day, like this: https://i.stack.imgur.com/LBUZq.png
So my question: Is possible to group data by each hour of each day to have a rapresentation like this?
timestamp flow
1 2020-03-30 06:00:00 708.0
2 2020-03-30 07:00:00 862.0
3 2020-03-30 08:00:00 858.0
4 2020-03-30 09:00:00 840.0
5 2020-03-30 10:00:00 835.0
...
Thanks in advance to anyone who replies.
Use df.reset_index
df = df.groupby(df.timestamp.dt.hour)['flow'].sum().reset_index()
df['flow'] = df['flow']/60
If I’m understanding your question correctly, it sounds to me like you have data from multiple hours and multiple dates and want each group to be a particular hour on a particular day? If that’s the case, then you’ll want to use two columns in your groupby. Try this:
import pandas as pd
import datetime as dt
df = pd.read_csv('staz_1.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
group = df.groupby([df.timestamp.dt.date, df.timestamp.dt.hour])['flow'].sum()/60
Be aware that this will created a multi-index in the resulting grouped dataframe, which can be tricky to deal with. You can get rid of that by using .reset_index()
on group
.
If you have columns within date time forms, you can make use of pandas functions to help group data more semantically using the .resample()
method.
You can group by any time value, like days
or hours
, so you don’t have to remember more complex syntax like df.groupby([df.timestamp.dt.date, df.timestamp.dt.hour])
. All you need is df.resample("H")
(if your index is already in time date).
Below is if you don’t have your index as a time date type. You’ll need to first specify what you’re aggregating on, which in this case is the timestamp
column.
import pandas as pd
import numpy as np
# Time is a column and created data per minute
df = pd.DataFrame({'timestamp': pd.date_range('2020-03-30', periods=300, freq='T'),
'flow': np.random.randint(60, 1000, 300)})
df
# timestamp flow
# 0 2020-03-30 00:00:00 488
# 1 2020-03-30 00:01:00 996
# 2 2020-03-30 00:02:00 437
# 3 2020-03-30 00:03:00 599
# 4 2020-03-30 00:04:00 405
# .. ... ...
# 295 2020-03-30 04:55:00 302
# 296 2020-03-30 04:56:00 425
# 297 2020-03-30 04:57:00 404
# 298 2020-03-30 04:58:00 987
# 299 2020-03-30 04:59:00 135
#
# [300 rows x 2 columns]
# Returns data frame
df.resample("H", on='timestamp').sum() / 60
# flow
# timestamp
# 2020-03-30 00:00:00 523.350000
# 2020-03-30 01:00:00 548.033333
# 2020-03-30 02:00:00 516.466667
# 2020-03-30 03:00:00 425.533333
# 2020-03-30 04:00:00 490.416667
Below is if you do have it as an index.
# Index is time
df_idx = pd.DataFrame({'flow': np.random.randint(60, 1000, 300)},
index=pd.date_range('2020-03-30', periods=300, freq='T'))
df_idx
# flow
# 2020-03-30 00:00:00 532
# 2020-03-30 00:01:00 341
# 2020-03-30 00:02:00 964
# 2020-03-30 00:03:00 885
# 2020-03-30 00:04:00 186
# ... ...
# 2020-03-30 04:55:00 996
# 2020-03-30 04:56:00 946
# 2020-03-30 04:57:00 510
# 2020-03-30 04:58:00 564
# 2020-03-30 04:59:00 918
#
# [300 rows x 1 columns]
# Returns a series
df_idx['flow'].resample('H').sum() / 60
# 2020-03-30 00:00:00 569.516667
# 2020-03-30 01:00:00 548.050000
# 2020-03-30 02:00:00 505.283333
# 2020-03-30 03:00:00 530.566667
# 2020-03-30 04:00:00 522.383333
# Freq: H, Name: flow, dtype: float64
The pandas documentation page on the .resample()
method is quite useful as well https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html#pandas.DataFrame.resample.