How to get min date and max date from pandas df based on another column
Question:
I have a pandas df as below:
Name
Date1
Date2
One
199007
199010
One
199206
199206
One
199505
199505
Two
19880701
19880701
Two
19980704
19980704
Three
2020
2020
Three
2022
2022
Dates could be in the following format:
(yyyy-mm)
(yyyymmdd)
(yyyymm)
(yyyy)
(yyyy-mm-dd)
The requirement is – based on Name column -from Date1 column find the minimum value and from Date2 column find the maximum value.
Expected Data will look like this:
Name
Date1
Date2
One
199007
199505
Two
19880701
19980704
Three
2020
2022
I have tried reading date1 and date2 columns as below:
df[‘date1’]=pd.to_datetime(df[‘date1’])
But it throws error – month must be in 1…12 199007
when i run this line of code:
df[‘date1’]=pd.to_datetime(df[‘date1′],format=’%Y%m%d’,errors=ignore)
but this is only ignoring error and nothing changes in the pandas df.
What i am trying to do here is first read date1 and date2 column as datetime format and then try finding min and max values for date1 and date2 column clubbing duplicate Name.
Answers:
You can use .groupby and .agg. Then sort Name column from a custom list. Finally convert Date columns to datetime.
df = df.groupby("Name").agg({"Date1": "min", "Date2": "max"}).reset_index()
sort_list = ["One", "Two", "Three"]
list_series = pd.Series(range(len(sort_list)), index=sort_list)
df = df.sort_values("Name", key=lambda x: x.map(list_series)).reset_index(drop=True)
date_columns = [x for x in df.columns[df.columns.str.contains("Date")]]
for column in date_columns:
df[column] = (
pd.to_datetime(df[column], format="%Y%m%d", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y%m", errors="coerce"))
.fillna(pd.to_datetime(df[column], format="%Y", errors="coerce"))
)
print(df)
Name Date1 Date2
0 One 1990-07-01 1995-05-01
1 Two 1988-07-01 1998-07-04
2 Three 2020-01-01 2022-01-01
I have a pandas df as below:
Name | Date1 | Date2 |
---|---|---|
One | 199007 | 199010 |
One | 199206 | 199206 |
One | 199505 | 199505 |
Two | 19880701 | 19880701 |
Two | 19980704 | 19980704 |
Three | 2020 | 2020 |
Three | 2022 | 2022 |
Dates could be in the following format:
(yyyy-mm)
(yyyymmdd)
(yyyymm)
(yyyy)
(yyyy-mm-dd)
The requirement is – based on Name column -from Date1 column find the minimum value and from Date2 column find the maximum value.
Expected Data will look like this:
Name | Date1 | Date2 |
---|---|---|
One | 199007 | 199505 |
Two | 19880701 | 19980704 |
Three | 2020 | 2022 |
I have tried reading date1 and date2 columns as below:
df[‘date1’]=pd.to_datetime(df[‘date1’])
But it throws error – month must be in 1…12 199007
when i run this line of code:
df[‘date1’]=pd.to_datetime(df[‘date1′],format=’%Y%m%d’,errors=ignore)
but this is only ignoring error and nothing changes in the pandas df.
What i am trying to do here is first read date1 and date2 column as datetime format and then try finding min and max values for date1 and date2 column clubbing duplicate Name.
You can use .groupby and .agg. Then sort Name column from a custom list. Finally convert Date columns to datetime.
df = df.groupby("Name").agg({"Date1": "min", "Date2": "max"}).reset_index()
sort_list = ["One", "Two", "Three"]
list_series = pd.Series(range(len(sort_list)), index=sort_list)
df = df.sort_values("Name", key=lambda x: x.map(list_series)).reset_index(drop=True)
date_columns = [x for x in df.columns[df.columns.str.contains("Date")]]
for column in date_columns:
df[column] = (
pd.to_datetime(df[column], format="%Y%m%d", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y%m", errors="coerce"))
.fillna(pd.to_datetime(df[column], format="%Y", errors="coerce"))
)
print(df)
Name Date1 Date2
0 One 1990-07-01 1995-05-01
1 Two 1988-07-01 1998-07-04
2 Three 2020-01-01 2022-01-01