Pandas – By the same ID perform multiple conditions on the dataframe
Question:
I have a challenge when applying multiple conditions in columns, never did it before and would be appreciated some help,from teh database it is required:
ID user reception_date end_date Status
0 42872 [email protected] 2022-03-30 2022-03-30 Accepted
1 42872 [email protected] 2022-03-01 2022-03-04 Returned
2 42872 [email protected] 2022-03-07 2022-03-30 In Study
3 9999 [email protected] 2022-03-07 2022-03-30 Rejected
if the ID is the same, check if in the Status column has the status of "Accepted", once verified this first requirement, check if the "end_date" of "Accepted" is greater or equal to the date of the status "In Study", if this condition is true change the status from "In Study" to "Accepted".
The expected output would be as follows:
ID user reception_date end_date Status
0 42872 [email protected] 2022-03-30 2022-03-30 Accepted
1 42872 [email protected] 2022-03-01 2022-03-04 Returned
2 42872 [email protected] 2022-03-07 2022-03-30 Accepted
3 9999 [email protected] 2022-03-07 2022-03-30 Rejected
I have tried several methods to make comparisons such as np.where
, df.loc
and tried using apply()
, however the results weren’t good as I expected, I don’t have much knowledge about Pandas and I am still learning, thank you very much!
Answers:
This is actually not so straightforward. You need to perform a merge_asof
to find the closest forward value per row of interest.
# ensure using datetime
df[['reception_date', 'end_date']] = df[['reception_date', 'end_date']].apply(pd.to_datetime)
# boolean masks for selection of InStudy/Accepted
m1 = df['Status'].eq('In Study')
m2 = df['Status'].eq('Accepted')
# finding matching rows
s = pd.merge_asof(df.loc[m1, ['reception_date', 'ID']]
.reset_index() # save index, we'll need it later
.sort_values(by='reception_date'),
df.loc[m2, ['end_date', 'ID', 'Status']]
.sort_values(by='end_date'),
left_on='reception_date', right_on='end_date',
by='ID', direction='forward'
).set_index('index')['Status'].dropna()
# updating values in place
df.update(s)
print(df)
Output:
ID user reception_date end_date Status
0 42872 [email protected] 2022-03-30 2022-03-30 Accepted
1 42872 [email protected] 2022-03-01 2022-03-04 Returned
2 42872 [email protected] 2022-03-07 2022-03-30 Accepted
3 9999 [email protected] 2022-03-07 2022-03-30 Rejected
I have a challenge when applying multiple conditions in columns, never did it before and would be appreciated some help,from teh database it is required:
ID user reception_date end_date Status
0 42872 [email protected] 2022-03-30 2022-03-30 Accepted
1 42872 [email protected] 2022-03-01 2022-03-04 Returned
2 42872 [email protected] 2022-03-07 2022-03-30 In Study
3 9999 [email protected] 2022-03-07 2022-03-30 Rejected
if the ID is the same, check if in the Status column has the status of "Accepted", once verified this first requirement, check if the "end_date" of "Accepted" is greater or equal to the date of the status "In Study", if this condition is true change the status from "In Study" to "Accepted".
The expected output would be as follows:
ID user reception_date end_date Status
0 42872 [email protected] 2022-03-30 2022-03-30 Accepted
1 42872 [email protected] 2022-03-01 2022-03-04 Returned
2 42872 [email protected] 2022-03-07 2022-03-30 Accepted
3 9999 [email protected] 2022-03-07 2022-03-30 Rejected
I have tried several methods to make comparisons such as np.where
, df.loc
and tried using apply()
, however the results weren’t good as I expected, I don’t have much knowledge about Pandas and I am still learning, thank you very much!
This is actually not so straightforward. You need to perform a merge_asof
to find the closest forward value per row of interest.
# ensure using datetime
df[['reception_date', 'end_date']] = df[['reception_date', 'end_date']].apply(pd.to_datetime)
# boolean masks for selection of InStudy/Accepted
m1 = df['Status'].eq('In Study')
m2 = df['Status'].eq('Accepted')
# finding matching rows
s = pd.merge_asof(df.loc[m1, ['reception_date', 'ID']]
.reset_index() # save index, we'll need it later
.sort_values(by='reception_date'),
df.loc[m2, ['end_date', 'ID', 'Status']]
.sort_values(by='end_date'),
left_on='reception_date', right_on='end_date',
by='ID', direction='forward'
).set_index('index')['Status'].dropna()
# updating values in place
df.update(s)
print(df)
Output:
ID user reception_date end_date Status
0 42872 [email protected] 2022-03-30 2022-03-30 Accepted
1 42872 [email protected] 2022-03-01 2022-03-04 Returned
2 42872 [email protected] 2022-03-07 2022-03-30 Accepted
3 9999 [email protected] 2022-03-07 2022-03-30 Rejected