Python script to find the number of date columns in a csv file and update the date format to MM-DD-YYYY
Question:
I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :
Execution_date
Extract_date
Requestor_Name
Count
2023-01-15
2023-01-15
John Smith
7
Sometimes we dont get the second column above – extract_date :
Execution_date
Requestor_Name
Count
17-01-2023
Andrew Mill
3
Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.
So the sample output of above 2 files will be :
Execution_date
Extract_date
Requestor_Name
Count
01-15-2023
01-15-2023
John Smith
7
Execution_date
Requestor_Name
Count
01-17-2023
Andrew Mill
3
I am using pandas and can’t figure out how to deal with the missing second column on some days and the change of the date value format.
I can hardcode the 2 column names and change the format by :
df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.
Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can’t be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.
Answers:
The following loads a CSV file into a dataframe, checks each value (that is a str
) to see if it matches one of the date formats, and if it does rearranges the date to the format you’re looking for. Other values are untouched.
import pandas as pd
import re
df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(dd)-(dd)-(dddd)")
d_m_y_replace = r"2-1-3"
y_m_d = re.compile(r"(dddd)-(dd)-(dd)")
y_m_d_replace = r"2-3-1"
def change_dt(value):
if isinstance(value, str):
if d_m_y.fullmatch(value):
return d_m_y.sub(d_m_y_replace, value)
elif y_m_d.fullmatch(value):
return y_m_d.sub(y_m_d_replace, value)
return value
new_df = df.applymap(change_dt)
However, if there are other columns containing dates that you don’t want to change, and you just want to specify the columns to be altered, use this instead of the last line above:
cols = ["Execution_date", "Extract_date"]
for col in cols:
if col in df.columns:
df[col] = df[col].apply(change_dt)
You can convert the columns to datetimes if you wish.
You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).
import pandas as pd
def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
for column in df.columns[df.columns.str.contains(column_name, case=False)]:
df[column] = (
pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
).dt.strftime("%m-%d-%Y")
return df
data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])
data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])
final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)
Output:
Execution_date Extract_date Requestor_Name Count
0 01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
0 01-17-2023 Andrew Mill 3
I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :
Execution_date | Extract_date | Requestor_Name | Count |
---|---|---|---|
2023-01-15 | 2023-01-15 | John Smith | 7 |
Sometimes we dont get the second column above – extract_date :
Execution_date | Requestor_Name | Count |
---|---|---|
17-01-2023 | Andrew Mill | 3 |
Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.
So the sample output of above 2 files will be :
Execution_date | Extract_date | Requestor_Name | Count |
---|---|---|---|
01-15-2023 | 01-15-2023 | John Smith | 7 |
Execution_date | Requestor_Name | Count |
---|---|---|
01-17-2023 | Andrew Mill | 3 |
I am using pandas and can’t figure out how to deal with the missing second column on some days and the change of the date value format.
I can hardcode the 2 column names and change the format by :
df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.
Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can’t be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.
The following loads a CSV file into a dataframe, checks each value (that is a str
) to see if it matches one of the date formats, and if it does rearranges the date to the format you’re looking for. Other values are untouched.
import pandas as pd
import re
df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(dd)-(dd)-(dddd)")
d_m_y_replace = r"2-1-3"
y_m_d = re.compile(r"(dddd)-(dd)-(dd)")
y_m_d_replace = r"2-3-1"
def change_dt(value):
if isinstance(value, str):
if d_m_y.fullmatch(value):
return d_m_y.sub(d_m_y_replace, value)
elif y_m_d.fullmatch(value):
return y_m_d.sub(y_m_d_replace, value)
return value
new_df = df.applymap(change_dt)
However, if there are other columns containing dates that you don’t want to change, and you just want to specify the columns to be altered, use this instead of the last line above:
cols = ["Execution_date", "Extract_date"]
for col in cols:
if col in df.columns:
df[col] = df[col].apply(change_dt)
You can convert the columns to datetimes if you wish.
You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).
import pandas as pd
def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
for column in df.columns[df.columns.str.contains(column_name, case=False)]:
df[column] = (
pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
).dt.strftime("%m-%d-%Y")
return df
data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])
data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])
final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)
Output:
Execution_date Extract_date Requestor_Name Count
0 01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
0 01-17-2023 Andrew Mill 3