Pandas groupby doesn't take UTC time into account

Question:

I have a data frame of several days, that looks something like:

df = 

                            col1
               date
2022-10-31 23:00:00    89.088556
2022-11-01 00:00:00    91.356805
2022-11-01 01:00:00    43.188002
2022-11-01 02:00:00    40.386937
2022-11-01 03:00:00    38.045470
...                          ...
2022-11-28 18:00:00   320.695662
2022-11-28 19:00:00   289.392580
2022-11-28 20:00:00   266.770852
2022-11-28 21:00:00   258.787157
2022-11-28 22:00:00   238.077054

So hourly interval in UTC.

Now, what I would like to do is just do a mean for each day. However, if I do something like df.groupby(df.index.day) the resulting df for one of the days looks something like this:

df_day1 =

                            col1     
               date
2022-11-01 00:00:00    91.356805
2022-11-01 01:00:00    43.188002
2022-11-01 02:00:00    40.386937
2022-11-01 03:00:00    38.045470
2022-11-01 04:00:00    38.063055
2022-11-01 05:00:00    44.420651
2022-11-01 06:00:00    99.115480
2022-11-01 07:00:00   105.455595
2022-11-01 08:00:00    97.385403
2022-11-01 09:00:00    88.105182
2022-11-01 10:00:00    82.650731
2022-11-01 11:00:00    79.717211
2022-11-01 12:00:00    78.173303
2022-11-01 13:00:00    72.926578
2022-11-01 14:00:00    77.644380
2022-11-01 15:00:00    95.185876
2022-11-01 16:00:00   110.620416
2022-11-01 17:00:00   124.516274
2022-11-01 18:00:00   117.191289
2022-11-01 19:00:00   102.931563
2022-11-01 20:00:00    96.657752
2022-11-01 21:00:00    93.358915
2022-11-01 22:00:00    92.043226
2022-11-02 23:00:00    56.089820

As such it seems okay. But since my date is in UTC, the correct day is actually from 23-22, and not 00-23. In my case this makes a difference.

I’ve tried to convert to localize to UTC, convert to CET and stuff like that. But it always shifts the hours used for the correct day, so that it is in fact not the right hours used for each day.

Am I doing something wrong ?

Asked By: Denver Dang

||

Answers:

You may use localization inside group by. This way your group will contain all the times for 1st day and 23:00 for the 31st day, same as in you first table reference.

df.groupby(df.index.tz_localize("UTC").tz_convert("Europe/Copenhagen").day).get_group(1)

Btw, a snippet to reproduce your situation:

import pandas as pd
import numpy as np
rng = pd.date_range("2022-10-31 23:00:00", periods=28*24, freq='H')
df = pd.DataFrame({'col1':np.random.randn(len(rng))}, index=rng)

UPD:
Another solution escaping time zones would be assembling groups based on a function:

group_counter = 0
def GroupFunc(ind):
    global group_counter
    if ind.hour == 23:
        group_counter += 1
    return group_counter
df.groupby(lambda x: GroupFunc(x))

Here .groupby() will iterate through the whole DataFrame and each time it finds 23:00 it will create a new group. This means each group will be 23 to 22.
⚠ in this case the original DataFrame should be sorted ascending.

Doing .groupby() with custom functions has a rare use, is considered a bit difficult and it’s difficult to understand and maintain without comments.

But! You don’t have to care about time zones anymore. And in my case the custom function approach worked 4 times faster.

Answered By: DiMithras
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.