Select previous row from the given parameters

Question:

I have a data frame that looks like this:

id code date
1 37 2022-01-11
1 22 2021-10-01
1 39 2019-02-11
1 21 2018-10-08
1 17 2018-09-19
1 18 2018-09-10
1 39 2017-03-20
1 36 2017-02-28
1 34 2017-02-14
1 31 2017-01-20
1 21 2016-11-17
1 17 2016-10-20

As you can see, the codes are repeating after a specific date difference. I want to obtain the rows previous to the largest time difference.

For instance:

id code date
1 22 2021-10-01
1 39 2019-02-11

Should give me the output:

id code date
1 39 2019-02-11

Since a significant difference exists between 2019 and 2021, I want the previous row that is available before the vast difference between the dates.

The output from the above data frame would be as follows:

id code date
1 37 2022-01-11
1 39 2019-02-11
1 39 2017-03-20

This is the code I tried, but it gives me only the first value for each time difference. I tried max, min, first and last, but it is giving me the same result:

NOTE: sel is the data frame here. I used it in previous calculations, so I am using the same name here.

from datetime import datetime, timedelta

sel['date'] = pd.to_datetime(sel['date'], format='%Y%m%d') # convert date column to datetime

sel = sel.sort_values(by=['id', 'date']) # sort the dataframe by patient_id and date

sel['time_diff'] = sel.groupby('id')['date'].diff() # calculate time difference between consecutive rows for each patient

mask = (sel['time_diff'] >= timedelta(days=365)) | (sel['time_diff'].isnull()) # find rows where the time difference is greater than or equal to 1 year or null (first row for each id)

output_df = sel.loc[mask].groupby(['id', 'code']).agg({'date': 'max'}).reset_index() # select the rows where the mask is True and get the max date for each code for each patient

output_df = output_df[['id', 'code', 'date']] # select the desired columns
output_df

Any help is highly appreciated.
Thank you!

Asked By: Sukrut Shishupal

||

Answers:

Your code almost works:

# see the ascending param
df = df.sort_values(by=['id', 'date'], ascending=[True,False]) # sort the dataframe by patient_id and date

time_diff = df.groupby('id')['date'].diff() # calculate time difference between consecutive rows for each patient

# see the different comparison
mask = (time_diff < pd.Timedelta(days=-365)) | time_diff.isna() # find rows where the time difference is greater than or equal to 1 year or null (first row for each id)

# just mask here
df[mask]

Output:

   id  code       date
0   1    37 2022-01-11
2   1    39 2019-02-11
6   1    39 2017-03-20
Answered By: Quang Hoang
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.