Python script to find the number of date columns in a csv file and update the date format to MM-DD-YYYY

Question:

I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :

Execution_date Extract_date Requestor_Name Count
2023-01-15 2023-01-15 John Smith 7

Sometimes we dont get the second column above – extract_date :

Execution_date Requestor_Name Count
17-01-2023 Andrew Mill 3

Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.

So the sample output of above 2 files will be :

Execution_date Extract_date Requestor_Name Count
01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
01-17-2023 Andrew Mill 3

I am using pandas and can’t figure out how to deal with the missing second column on some days and the change of the date value format.

I can hardcode the 2 column names and change the format by :

df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')

This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.

Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can’t be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.

Asked By: Arty155

||

Answers:

The following loads a CSV file into a dataframe, checks each value (that is a str) to see if it matches one of the date formats, and if it does rearranges the date to the format you’re looking for. Other values are untouched.

import pandas as pd
import re

df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(dd)-(dd)-(dddd)")
d_m_y_replace = r"2-1-3"
y_m_d = re.compile(r"(dddd)-(dd)-(dd)")
y_m_d_replace = r"2-3-1"

def change_dt(value):
    if isinstance(value, str):
        if d_m_y.fullmatch(value):
            return d_m_y.sub(d_m_y_replace, value)
        elif y_m_d.fullmatch(value):
            return y_m_d.sub(y_m_d_replace, value)
    return value

new_df = df.applymap(change_dt)

However, if there are other columns containing dates that you don’t want to change, and you just want to specify the columns to be altered, use this instead of the last line above:

cols = ["Execution_date", "Extract_date"]

for col in cols:
    if col in df.columns:
        df[col] = df[col].apply(change_dt)

You can convert the columns to datetimes if you wish.

Answered By: MattDMo

You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).

import pandas as pd


def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
    for column in df.columns[df.columns.str.contains(column_name, case=False)]:
        df[column] = (
            pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
            .fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
        ).dt.strftime("%m-%d-%Y")

    return df


data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])

data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])


final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)

Output:

  Execution_date Extract_date Requestor_Name  Count
0     01-15-2023   01-15-2023     John Smith      7

  Execution_date Requestor_Name  Count
0     01-17-2023    Andrew Mill      3
Answered By: Jason Baker