How to convert the datetime while working on a big data?

Question:

enter image description hereI’m working on Colab and trying to separate out a test set of the last 2 months of data but I’m facing this error (ValueError: Both dates must have the same UTC offset), I know the error is because the start date of the set is in BST and the end date is in GMT.

latest_df = df.loc['Sat 01 Oct 2022 12:00:03 AM BST':'Thu 01 Dec 2022 10:02:02 AM GMT']

latest_df.head()

I tried to convert the time manually on the excel of the dataset but it will take a long time to convert all dates because it is a big data.

Asked By: Wilson

||

Answers:

You can simply convert your start_date timezone instead of converting whole data.

You can use the pytz library to convert the dates to the same timezone. Here’s an example:

import pytz

# Set the timezone for the start and end dates
start_tz = pytz.timezone('Europe/London')
end_tz = pytz.timezone('Europe/London')

# Convert the start and end dates to the same timezone
start_date = start_tz.localize(df['Sat 01 Oct 2022 12:00:03 AM BST'])
end_date = end_tz.localize(df['Thu 01 Dec 2022 10:02:02 AM GMT'])

# Select the rows between the start and end dates
latest_df = df.loc[start_date:end_date]
latest_df.head()
Answered By: divyavinod6

Since I did not know the names of your columns, I assumed them to be A to F. You can replace them in the code with your column names:

import random
import pandas as pd
import numpy as np
import datetime
import pytz

# Create some sample data for testing
data = [
    'Sat 01 Oct 2022 12:00:03 AM BST',
    'Sat 01 Oct 2022 11:00:03 AM BST',
    'Sat 01 Oct 2022 10:00:03 AM BST',
    'Thu 01 Dec 2022 9:02:02 AM GMT',
    'Thu 01 Dec 2022 8:02:02 AM GMT',
    'Thu 01 Dec 2022 7:02:02 AM GMT'
]

df = pd.DataFrame(
    {
        "A": pd.Series(data),
        "B": pd.Series(np.random.randint(0,100,size=(6,))),
        "C": pd.Series(np.random.randint(0,100,size=(6,))),
        "D": pd.Series(np.random.randint(0,100,size=(6,))),
        "E": pd.Series(np.random.randint(0,100,size=(6,))),
        "F": pd.Series(np.random.randint(0,100,size=(6,)))
    })

# Create a new column of offsets, sclicing the datetime
df["offset"] = df.A.apply(lambda x: x[-3:])

# Convert the format of dates to standard datetime format
df["A"] = df.A.apply(lambda x: pd.to_datetime(x, infer_datetime_format=True))


>>> df

Output:

               A         B  C   D   E   F   offset
0   2022-10-01 00:00:03 60  39  66  49  31  BST
1   2022-10-01 11:00:03 25  87  42  74  39  BST
2   2022-10-01 10:00:03 82  95  36  45  30  BST
3   2022-12-01 09:02:02 27  21  44  58  74  GMT
4   2022-12-01 08:02:02 33  38  23  97  57  GMT
5   2022-12-01 07:02:02 53  42  32  67  95  GMT

I wrote a custom function to convert the timezones and apply it to the dataframe:

# Write a function to change the timezones to UCT/GMT
def convert_datetime_timezone(dt, tz1, tz2="UCT"):
     
    """
    dt: date time string
    tz1: initial time zone, defualt=UCT
    tz2: target time zone
       """
    if tz1 == "BST":
        tz1 = pytz.timezone("Europe/London")
        tz2 = pytz.timezone(tz2)

        # dt = datetime.datetime.strptime(dt,"%Y-%m-%d %H:%M:%S")
        dt = tz1.localize(dt)
        dt = dt.astimezone(tz2)
        dt = dt.strftime("%Y-%m-%d %H:%M:%S")
        converted_dt = pd.to_datetime(dt)
        return converted_dt
    else:
        return dt

# Apply the function and drop the offset column
df["A"] = df.apply(lambda x: convert_datetime_timezone(x["A"], x["offset"]), axis=1)
df.drop("offset", axis=1, inplace=True)

# Set your datetime as index so that you can use loc to target a date range
df.set_index("A", drop=True, inplace=True)
df.loc["2022-10-01 00:00:03":"2022-10-01 10:00:03",:]

Output:

                    B   C   D   E   F
A                   
2022-10-01 00:00:03 60  39  66  49  31
2022-10-01 10:00:03 82  95  36  45  30
Answered By: ali bakhtiari
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.