Creating Consistent Time Format with Pandas

Question:

My overall goal is to full the hour from each data point to list each beginning time. To do this, I know I need to clean my data so that it is all in a consistent format. I have been trying to use to_datetime and df[time].dt.hour to pull the data needed, but it does not work as the formatting is inconsistent.

This is the data I am working with:

Work Hours
08:15 AM-03:15PM
M,T,W,Th: 7:45AM-3:05PM F:7:45AM-2:07PM
7:45am-3:00pm
7:45AM.-2:15 PM

My current code:
df[‘Work Hours’]_dt = pd.to_datetime(df)

I also tried:
df[‘Starting Time’] = df[‘Work Hours’].dt.hour

My primary concern is to clean the data firstly and eventually I want to extract only the starting time from each workplace so that it looks something like this:

Starting Time
8
7
9
7
Asked By: Steph

||

Answers:

This is a shot in the dark and maybe someone can come up with a better answer you can use regex to substitute patterns for example

regex = r"[a-zA-Z,]"

test_str = "M,T,W,Th: 7:45AM-3:05PM F:7:45AM-2:07PM"

subst = ""

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

which will make that example string into 7:45-3:05 :7:45-2:07

Then you can split on the : to extract the first hour however word of caution this will return the list [7,45-3,05 ,7,45-2,07] which is fine if you’re only looking for the first hour

Have a play about with regex to find the perfect pattern you’d like to match for https://regex101.com/

Answered By: Swinging Treebranch

You can extract the starting time from the string data in the ‘Work Hours’ column by using string manipulation techniques in Pandas. Here’s an example of how you could do that:

import pandas as pd

df = pd.DataFrame({'Work Hours': ['08:15 AM-03:15PM', 'M,T,W,Th: 7:45AM-3:05PM F:7:45AM-2:07PM', '7:45am-3:00pm', '7:45AM.-2:15 PM']})

def extract_start_time(work_hours):
    work_hours = work_hours.split(' ')[0]
    if 'AM' in work_hours:
        return int(work_hours.split(':')[0])
    elif 'PM' in work_hours:
        hour = int(work_hours.split(':')[0])
        if hour != 12:
            hour += 12
        return hour
    else:
        return None

df['Starting Time'] = df['Work Hours'].apply(extract_start_time)

This will give you a new column ‘Starting Time’ with the extracted starting times as integers. If the starting time cannot be extracted, it will return ‘none’.

Answered By: Luis Gerardo Runge