How to get min date and max date from pandas df based on another column

Question:

I have a pandas df as below:

Name Date1 Date2
One 199007 199010
One 199206 199206
One 199505 199505
Two 19880701 19880701
Two 19980704 19980704
Three 2020 2020
Three 2022 2022

Dates could be in the following format:
(yyyy-mm)
(yyyymmdd)
(yyyymm)
(yyyy)
(yyyy-mm-dd)

The requirement is – based on Name column -from Date1 column find the minimum value and from Date2 column find the maximum value.
Expected Data will look like this:

Name Date1 Date2
One 199007 199505
Two 19880701 19980704
Three 2020 2022

I have tried reading date1 and date2 columns as below:
df[‘date1’]=pd.to_datetime(df[‘date1’])

But it throws error – month must be in 1…12 199007

when i run this line of code:
df[‘date1’]=pd.to_datetime(df[‘date1′],format=’%Y%m%d’,errors=ignore)

but this is only ignoring error and nothing changes in the pandas df.

What i am trying to do here is first read date1 and date2 column as datetime format and then try finding min and max values for date1 and date2 column clubbing duplicate Name.

Asked By: Rag

||

Answers:

You can use .groupby and .agg. Then sort Name column from a custom list. Finally convert Date columns to datetime.

df = df.groupby("Name").agg({"Date1": "min", "Date2": "max"}).reset_index()

sort_list = ["One", "Two", "Three"]
list_series = pd.Series(range(len(sort_list)), index=sort_list)
df = df.sort_values("Name", key=lambda x: x.map(list_series)).reset_index(drop=True)

date_columns = [x for x in df.columns[df.columns.str.contains("Date")]]
for column in date_columns:
    df[column] = (
        pd.to_datetime(df[column], format="%Y%m%d", errors="coerce")
        .fillna(pd.to_datetime(df[column], format="%Y%m", errors="coerce"))
        .fillna(pd.to_datetime(df[column], format="%Y", errors="coerce"))
    )

print(df)

    Name      Date1      Date2
0    One 1990-07-01 1995-05-01
1    Two 1988-07-01 1998-07-04
2  Three 2020-01-01 2022-01-01
   
Answered By: Jason Baker
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.