Group dataframe by the hour of the day in Pandas

Question:

this is my first time here.
My aim is to group the data by the hour of the day, sum the ‘flow’ column for the rows of each group and divide it by 60.
But i’m having some difficulty about group my data by the hour of the day.

This is how my dataframe( over 150.000 rows) looks like:
https://i.stack.imgur.com/i51V2.png

I tried by using this code:

import pandas as pd
import datetime as dt

df = pd.read_csv('staz_1.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
group = df.groupby(df.timestamp.dt.hour)['flow'].sum()/60 

But i obtained data grouped only by hours without the distinction of the day, like this: https://i.stack.imgur.com/LBUZq.png

So my question: Is possible to group data by each hour of each day to have a rapresentation like this?

   timestamp               flow
1  2020-03-30 06:00:00     708.0
2  2020-03-30 07:00:00     862.0 
3  2020-03-30 08:00:00     858.0
4  2020-03-30 09:00:00     840.0
5  2020-03-30 10:00:00     835.0
...

Thanks in advance to anyone who replies.

Asked By: Alessio Vacca

||

Answers:

Use df.reset_index

df = df.groupby(df.timestamp.dt.hour)['flow'].sum().reset_index()
df['flow'] = df['flow']/60
Answered By: deadshot

If I’m understanding your question correctly, it sounds to me like you have data from multiple hours and multiple dates and want each group to be a particular hour on a particular day? If that’s the case, then you’ll want to use two columns in your groupby. Try this:

import pandas as pd
import datetime as dt

df = pd.read_csv('staz_1.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
group = df.groupby([df.timestamp.dt.date, df.timestamp.dt.hour])['flow'].sum()/60 

Be aware that this will created a multi-index in the resulting grouped dataframe, which can be tricky to deal with. You can get rid of that by using .reset_index() on group.

Answered By: ben-fogarty

If you have columns within date time forms, you can make use of pandas functions to help group data more semantically using the .resample() method.

You can group by any time value, like days or hours, so you don’t have to remember more complex syntax like df.groupby([df.timestamp.dt.date, df.timestamp.dt.hour]). All you need is df.resample("H") (if your index is already in time date).

Below is if you don’t have your index as a time date type. You’ll need to first specify what you’re aggregating on, which in this case is the timestamp column.

import pandas as pd
import numpy as np

# Time is a column and created data per minute
df = pd.DataFrame({'timestamp': pd.date_range('2020-03-30', periods=300, freq='T'),
                   'flow': np.random.randint(60, 1000, 300)})
df
#               timestamp  flow
# 0   2020-03-30 00:00:00   488
# 1   2020-03-30 00:01:00   996
# 2   2020-03-30 00:02:00   437
# 3   2020-03-30 00:03:00   599
# 4   2020-03-30 00:04:00   405
# ..                  ...   ...
# 295 2020-03-30 04:55:00   302
# 296 2020-03-30 04:56:00   425
# 297 2020-03-30 04:57:00   404
# 298 2020-03-30 04:58:00   987
# 299 2020-03-30 04:59:00   135
# 
# [300 rows x 2 columns]

# Returns data frame
df.resample("H", on='timestamp').sum() / 60
#                            flow
# timestamp                      
# 2020-03-30 00:00:00  523.350000
# 2020-03-30 01:00:00  548.033333
# 2020-03-30 02:00:00  516.466667
# 2020-03-30 03:00:00  425.533333
# 2020-03-30 04:00:00  490.416667

Below is if you do have it as an index.

# Index is time
df_idx = pd.DataFrame({'flow': np.random.randint(60, 1000, 300)},
                      index=pd.date_range('2020-03-30', periods=300, freq='T'))
df_idx
#                      flow
# 2020-03-30 00:00:00   532
# 2020-03-30 00:01:00   341
# 2020-03-30 00:02:00   964
# 2020-03-30 00:03:00   885
# 2020-03-30 00:04:00   186
# ...                   ...
# 2020-03-30 04:55:00   996
# 2020-03-30 04:56:00   946
# 2020-03-30 04:57:00   510
# 2020-03-30 04:58:00   564
# 2020-03-30 04:59:00   918
# 
# [300 rows x 1 columns]

# Returns a series
df_idx['flow'].resample('H').sum() / 60
# 2020-03-30 00:00:00    569.516667
# 2020-03-30 01:00:00    548.050000
# 2020-03-30 02:00:00    505.283333
# 2020-03-30 03:00:00    530.566667
# 2020-03-30 04:00:00    522.383333
# Freq: H, Name: flow, dtype: float64

The pandas documentation page on the .resample() method is quite useful as well https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html#pandas.DataFrame.resample.

Answered By: Eric Leung
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.