Selecting dates in Pandas DataFrame to calculate daylight savings time

Question:

I’m trying to select a range of dates in a Pandas DataFrame (containing half hourly data) to determine the daylight savings time of those days. The start of DST is the last Sunday of September, and it ends on the first Sunday of April.

import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta

...

df0 = df0.set_index('datetime')

df0['mnth'] = pd.DatetimeIndex(df0.index).month
df0['dow'] = pd.DatetimeIndex(df0.index).dayofweek # Mon=0, ..., Sun=6

start_dst = df0.iloc[(df0.mnth==9) & (df0.dow==6).idxmax()]
end_dst = df0.iloc[(df0.mnth==4) & (df0.dow==6).idxmin()]
df0.index[start_dst:end_dst] = df0.index + pd.Timedelta('1h')

My data is essentially shifted 1 hour backwards in the Sep-Apr period, so I need to add 1h to the timestamps in this period. But when I define start_dst, I get an error

TypeError: Cannot perform 'and_' with a dtyped [bool] array and scalar of type [bool]

I’m not sure how to change start_dst.

Edit: Here is a sample dataframe:

# End DST: first Sunday of April, 1h backward (5 Apr 2020)
# Start DST: last Sunday of September, 1h forward (27 Sep 2020)
# 4,5,6 April 2020, 26,27,28 Sep 2020
d1 = '2020-04-04'
d2 = '2020-04-05'
d3 = '2020-04-06'
d4 = '2020-09-26'
d5 = '2020-09-27'
d6 = '2020-09-28'

df1 = pd.DataFrame()
df1['date'] = pd.to_datetime([d1]*24, format='%Y-%m-%d')
df1['time'] = (pd.date_range(d1, periods=24, freq='H') - pd.Timedelta(hours=1)).time
df1 = df1.set_index('date')

df2 = pd.DataFrame()
df2['date'] = pd.to_datetime([d2]*25, format='%Y-%m-%d')
df2['time'] = (pd.date_range(d2, periods=25, freq='H') - pd.Timedelta(hours=1)).time
df2 = df2.set_index('date')

df3 = pd.DataFrame()
df3['date'] = pd.to_datetime([d3]*24, format='%Y-%m-%d')
df3['time'] = (pd.date_range(d3, periods=24, freq='H')).time
df3 = df3.set_index('date')

df4 = pd.DataFrame()
df4['date'] = pd.to_datetime([d4]*24, format='%Y-%m-%d')
df4['time'] = (pd.date_range(d4, periods=24, freq='H')).time
df4 = df4.set_index('date')

df5 = pd.DataFrame()
df5['date'] = pd.to_datetime([d5]*23, format='%Y-%m-%d')
df5a = pd.DataFrame(pd.date_range('00:00', '01:59', freq='H').time)
df5b = pd.DataFrame(pd.date_range('01:00', '01:59', freq='H').time)
df5c = pd.DataFrame(pd.date_range('03:00', '22:00', freq='H').time)
df5['time'] = pd.concat([df5a,df5b,df5c],axis=0).values
df5 = df5.set_index('date')

df6 = pd.DataFrame()
df6['date'] = pd.to_datetime([d6]*24, format='%Y-%m-%d')
df6['time'] = (pd.date_range(d6, periods=24, freq='H') - pd.Timedelta(hours=1)).time
df6 = df6.set_index('date')

df0 = pd.DataFrame()
z = df1.append(df2).append(df3).append(df4).append(df5).append(df6)
df0['datetime'] = pd.to_datetime(z.index.astype(str)+' '+z.time.astype(str),
                            format='%Y-%m-%d %H:%M:%S')
df0 = df0.set_index('datetime')

df0['mnth'] = pd.DatetimeIndex(df0.index).month
df0['dow'] = pd.DatetimeIndex(df0.index).dayofweek # Mon=0, ..., Sun=6
df0['hour'] = pd.DatetimeIndex(df0.index).hour
Asked By: Medulla Oblongata

||

Answers:

I believe the error is because of the idxmax() and idxmin(); Both return the index number, and this index isn’t a bool type. The (df0.mnth==9) and (df0.mnth==4) will return a array of True and False; and when u try compare them, this error will occur.

You can create/define a function that will give you the index by calculating the condition:

def get_indexex():
    try:
        idxmx=df0.index==((df0['dow']==6).idxmax())
        idxmn=df0.index==((df0['dow']==6).idxmin())
        start_dst = df0.loc[(df0['mnth']==9) & idxmx]
        end_dst = df0.loc[(df0['mnth']==4) & idxmn]
        if not start_dst.index.tolist():
            return df0.loc[:end_dst.index[-1]].index
        elif not end_dst.index.tolist():
            return  df0.loc[start_dst.index[0]:].index
        else:
            return  df0.loc[start_dst.index[0]:end_dst.index[-1]].index
    except IndexError:
        start_dst=df0.loc[(df0['dow'].eq(6) & df0['mnth'].eq(9)) & df0['hour'].eq(2)]
        end_dst=df0.loc[df0['mnth'].eq(4) & df0['hour'].eq(3)]
        if not start_dst.index.tolist():
            return df0.loc[:end_dst.index[-1]].index
        elif not end_dst.index.tolist():
            return  df0.loc[start_dst.index[0]:].index
        else:
            return  df0.loc[start_dst.index[0]:end_dst.index[-1]].index

Finally:

df0['dt']=df0.index
m=df0.index.isin(get_indexex())
df0.loc[m,'dt']=df0.loc[m,'dt']+pd.Timedelta('1H')
df0.index=df0.pop('dt')

Reasons to some things:

  • you can’t make change in the index of subset so for this we created 'dt' column and set that value equal to the index of our dataframe

  • we make idxmx variable for idxmax() and idxmn variable for idxmin() which are comparing values of idxmax() and idxmin() with the index of dataframe and siving you a bolean array and you are getting error because (df0.dow==6).idxmax() or (df0.dow==6).idxmin() gives you a single value not a Series of boolean value

  • we are defining a function named get_indexex() which will give you the indexes of index where condition satisfies to handle such situation when start_dst is an empty dataframe

  • Also 1 thing to notice here inside the function we are gettting the index upto 0th index of start_dst and last index of end_dst for those cases if start_dst and end_dst contains multiple entries

Update:

You are getting 2020-04-05 23:00:00 from the function because your condition satisfying so any one of the end_dst and start_dst giving you the result so if you don’t want then you an remove this case from the function so now it becomes:

def get_indexex():
    start_dst=df0.loc[(df0['dow'].eq(6) & df0['mnth'].eq(9)) & df0['hour'].eq(2)]
    end_dst=df0.loc[df0['mnth'].eq(4) & df0['hour'].eq(3)]
    if not start_dst.index.tolist():
        return df0.loc[:end_dst.index[-1]].index
    elif not end_dst.index.tolist():
        return  df0.loc[start_dst.index[0]:].index
    else:
        return  df0.loc[start_dst.index[0]:end_dst.index[-1]].index

Finally:

df0['dt']=df0.index
m=df0.index.isin(get_indexex())
df0.loc[m,'dt']=df0.loc[m,'dt']+pd.Timedelta('1H')
df0.index=df0.pop('dt')
Answered By: Anurag Dabas

The thought of dealing manually with DST gives me headache. Pandas timestamp objects (single values of a Series) have the dst() function, which returns the daylight saving time difference.

Answered By: Christian Pao.
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.