Creating Consistent Time Format with Pandas
Question:
My overall goal is to full the hour from each data point to list each beginning time. To do this, I know I need to clean my data so that it is all in a consistent format. I have been trying to use to_datetime and df[time].dt.hour to pull the data needed, but it does not work as the formatting is inconsistent.
This is the data I am working with:
Work Hours
08:15 AM-03:15PM
M,T,W,Th: 7:45AM-3:05PM F:7:45AM-2:07PM
7:45am-3:00pm
7:45AM.-2:15 PM
My current code:
df[‘Work Hours’]_dt = pd.to_datetime(df)
I also tried:
df[‘Starting Time’] = df[‘Work Hours’].dt.hour
My primary concern is to clean the data firstly and eventually I want to extract only the starting time from each workplace so that it looks something like this:
Starting Time
8
7
9
7
Answers:
This is a shot in the dark and maybe someone can come up with a better answer you can use regex to substitute patterns for example
regex = r"[a-zA-Z,]"
test_str = "M,T,W,Th: 7:45AM-3:05PM F:7:45AM-2:07PM"
subst = ""
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
which will make that example string into 7:45-3:05 :7:45-2:07
Then you can split on the : to extract the first hour however word of caution this will return the list [7,45-3,05 ,7,45-2,07] which is fine if you’re only looking for the first hour
Have a play about with regex to find the perfect pattern you’d like to match for https://regex101.com/
You can extract the starting time from the string data in the ‘Work Hours’ column by using string manipulation techniques in Pandas. Here’s an example of how you could do that:
import pandas as pd
df = pd.DataFrame({'Work Hours': ['08:15 AM-03:15PM', 'M,T,W,Th: 7:45AM-3:05PM F:7:45AM-2:07PM', '7:45am-3:00pm', '7:45AM.-2:15 PM']})
def extract_start_time(work_hours):
work_hours = work_hours.split(' ')[0]
if 'AM' in work_hours:
return int(work_hours.split(':')[0])
elif 'PM' in work_hours:
hour = int(work_hours.split(':')[0])
if hour != 12:
hour += 12
return hour
else:
return None
df['Starting Time'] = df['Work Hours'].apply(extract_start_time)
This will give you a new column ‘Starting Time’ with the extracted starting times as integers. If the starting time cannot be extracted, it will return ‘none’.
My overall goal is to full the hour from each data point to list each beginning time. To do this, I know I need to clean my data so that it is all in a consistent format. I have been trying to use to_datetime and df[time].dt.hour to pull the data needed, but it does not work as the formatting is inconsistent.
This is the data I am working with:
Work Hours |
---|
08:15 AM-03:15PM |
M,T,W,Th: 7:45AM-3:05PM F:7:45AM-2:07PM |
7:45am-3:00pm |
7:45AM.-2:15 PM |
My current code:
df[‘Work Hours’]_dt = pd.to_datetime(df)
I also tried:
df[‘Starting Time’] = df[‘Work Hours’].dt.hour
My primary concern is to clean the data firstly and eventually I want to extract only the starting time from each workplace so that it looks something like this:
Starting Time |
---|
8 |
7 |
9 |
7 |
This is a shot in the dark and maybe someone can come up with a better answer you can use regex to substitute patterns for example
regex = r"[a-zA-Z,]"
test_str = "M,T,W,Th: 7:45AM-3:05PM F:7:45AM-2:07PM"
subst = ""
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
which will make that example string into 7:45-3:05 :7:45-2:07
Then you can split on the : to extract the first hour however word of caution this will return the list [7,45-3,05 ,7,45-2,07] which is fine if you’re only looking for the first hour
Have a play about with regex to find the perfect pattern you’d like to match for https://regex101.com/
You can extract the starting time from the string data in the ‘Work Hours’ column by using string manipulation techniques in Pandas. Here’s an example of how you could do that:
import pandas as pd
df = pd.DataFrame({'Work Hours': ['08:15 AM-03:15PM', 'M,T,W,Th: 7:45AM-3:05PM F:7:45AM-2:07PM', '7:45am-3:00pm', '7:45AM.-2:15 PM']})
def extract_start_time(work_hours):
work_hours = work_hours.split(' ')[0]
if 'AM' in work_hours:
return int(work_hours.split(':')[0])
elif 'PM' in work_hours:
hour = int(work_hours.split(':')[0])
if hour != 12:
hour += 12
return hour
else:
return None
df['Starting Time'] = df['Work Hours'].apply(extract_start_time)
This will give you a new column ‘Starting Time’ with the extracted starting times as integers. If the starting time cannot be extracted, it will return ‘none’.