Select previous row from the given parameters
Question:
I have a data frame that looks like this:
id
code
date
1
37
2022-01-11
1
22
2021-10-01
1
39
2019-02-11
1
21
2018-10-08
1
17
2018-09-19
1
18
2018-09-10
1
39
2017-03-20
1
36
2017-02-28
1
34
2017-02-14
1
31
2017-01-20
1
21
2016-11-17
1
17
2016-10-20
As you can see, the codes are repeating after a specific date difference. I want to obtain the rows previous to the largest time difference.
For instance:
id
code
date
1
22
2021-10-01
1
39
2019-02-11
Should give me the output:
id
code
date
1
39
2019-02-11
Since a significant difference exists between 2019 and 2021, I want the previous row that is available before the vast difference between the dates.
The output from the above data frame would be as follows:
id
code
date
1
37
2022-01-11
1
39
2019-02-11
1
39
2017-03-20
This is the code I tried, but it gives me only the first value for each time difference. I tried max, min, first and last, but it is giving me the same result:
NOTE: sel is the data frame here. I used it in previous calculations, so I am using the same name here.
from datetime import datetime, timedelta
sel['date'] = pd.to_datetime(sel['date'], format='%Y%m%d') # convert date column to datetime
sel = sel.sort_values(by=['id', 'date']) # sort the dataframe by patient_id and date
sel['time_diff'] = sel.groupby('id')['date'].diff() # calculate time difference between consecutive rows for each patient
mask = (sel['time_diff'] >= timedelta(days=365)) | (sel['time_diff'].isnull()) # find rows where the time difference is greater than or equal to 1 year or null (first row for each id)
output_df = sel.loc[mask].groupby(['id', 'code']).agg({'date': 'max'}).reset_index() # select the rows where the mask is True and get the max date for each code for each patient
output_df = output_df[['id', 'code', 'date']] # select the desired columns
output_df
Any help is highly appreciated.
Thank you!
Answers:
Your code almost works:
# see the ascending param
df = df.sort_values(by=['id', 'date'], ascending=[True,False]) # sort the dataframe by patient_id and date
time_diff = df.groupby('id')['date'].diff() # calculate time difference between consecutive rows for each patient
# see the different comparison
mask = (time_diff < pd.Timedelta(days=-365)) | time_diff.isna() # find rows where the time difference is greater than or equal to 1 year or null (first row for each id)
# just mask here
df[mask]
Output:
id code date
0 1 37 2022-01-11
2 1 39 2019-02-11
6 1 39 2017-03-20
I have a data frame that looks like this:
id | code | date |
---|---|---|
1 | 37 | 2022-01-11 |
1 | 22 | 2021-10-01 |
1 | 39 | 2019-02-11 |
1 | 21 | 2018-10-08 |
1 | 17 | 2018-09-19 |
1 | 18 | 2018-09-10 |
1 | 39 | 2017-03-20 |
1 | 36 | 2017-02-28 |
1 | 34 | 2017-02-14 |
1 | 31 | 2017-01-20 |
1 | 21 | 2016-11-17 |
1 | 17 | 2016-10-20 |
As you can see, the codes are repeating after a specific date difference. I want to obtain the rows previous to the largest time difference.
For instance:
id | code | date |
---|---|---|
1 | 22 | 2021-10-01 |
1 | 39 | 2019-02-11 |
Should give me the output:
id | code | date |
---|---|---|
1 | 39 | 2019-02-11 |
Since a significant difference exists between 2019 and 2021, I want the previous row that is available before the vast difference between the dates.
The output from the above data frame would be as follows:
id | code | date |
---|---|---|
1 | 37 | 2022-01-11 |
1 | 39 | 2019-02-11 |
1 | 39 | 2017-03-20 |
This is the code I tried, but it gives me only the first value for each time difference. I tried max, min, first and last, but it is giving me the same result:
NOTE: sel is the data frame here. I used it in previous calculations, so I am using the same name here.
from datetime import datetime, timedelta
sel['date'] = pd.to_datetime(sel['date'], format='%Y%m%d') # convert date column to datetime
sel = sel.sort_values(by=['id', 'date']) # sort the dataframe by patient_id and date
sel['time_diff'] = sel.groupby('id')['date'].diff() # calculate time difference between consecutive rows for each patient
mask = (sel['time_diff'] >= timedelta(days=365)) | (sel['time_diff'].isnull()) # find rows where the time difference is greater than or equal to 1 year or null (first row for each id)
output_df = sel.loc[mask].groupby(['id', 'code']).agg({'date': 'max'}).reset_index() # select the rows where the mask is True and get the max date for each code for each patient
output_df = output_df[['id', 'code', 'date']] # select the desired columns
output_df
Any help is highly appreciated.
Thank you!
Your code almost works:
# see the ascending param
df = df.sort_values(by=['id', 'date'], ascending=[True,False]) # sort the dataframe by patient_id and date
time_diff = df.groupby('id')['date'].diff() # calculate time difference between consecutive rows for each patient
# see the different comparison
mask = (time_diff < pd.Timedelta(days=-365)) | time_diff.isna() # find rows where the time difference is greater than or equal to 1 year or null (first row for each id)
# just mask here
df[mask]
Output:
id code date
0 1 37 2022-01-11
2 1 39 2019-02-11
6 1 39 2017-03-20