Pandas – By the same ID perform multiple conditions on the dataframe

Question:

I have a challenge when applying multiple conditions in columns, never did it before and would be appreciated some help,from teh database it is required:

      ID               user reception_date   end_date    Status
0  42872  [email protected]     2022-03-30 2022-03-30  Accepted
1  42872    [email protected]     2022-03-01 2022-03-04  Returned
2  42872  [email protected]     2022-03-07 2022-03-30  In Study
3   9999  [email protected]     2022-03-07 2022-03-30  Rejected

if the ID is the same, check if in the Status column has the status of "Accepted", once verified this first requirement, check if the "end_date" of "Accepted" is greater or equal to the date of the status "In Study", if this condition is true change the status from "In Study" to "Accepted".

The expected output would be as follows:

      ID               user reception_date   end_date    Status
0  42872  [email protected]     2022-03-30 2022-03-30  Accepted
1  42872    [email protected]     2022-03-01 2022-03-04  Returned
2  42872  [email protected]     2022-03-07 2022-03-30  Accepted    
3   9999  [email protected]     2022-03-07 2022-03-30  Rejected

I have tried several methods to make comparisons such as np.where, df.loc and tried using apply(), however the results weren’t good as I expected, I don’t have much knowledge about Pandas and I am still learning, thank you very much!

Asked By: Andrés HK

||

Answers:

This is actually not so straightforward. You need to perform a merge_asof to find the closest forward value per row of interest.

# ensure using datetime
df[['reception_date', 'end_date']] = df[['reception_date', 'end_date']].apply(pd.to_datetime)

# boolean masks for selection of InStudy/Accepted
m1 = df['Status'].eq('In Study')
m2 = df['Status'].eq('Accepted')

# finding matching rows
s = pd.merge_asof(df.loc[m1, ['reception_date', 'ID']]
                    .reset_index() # save index, we'll need it later
                    .sort_values(by='reception_date'),
                  df.loc[m2, ['end_date', 'ID', 'Status']]
                    .sort_values(by='end_date'),
                  left_on='reception_date', right_on='end_date',
                  by='ID', direction='forward'
).set_index('index')['Status'].dropna()

# updating values in place
df.update(s)

print(df)

Output:

      ID               user reception_date   end_date    Status
0  42872  [email protected]     2022-03-30 2022-03-30  Accepted
1  42872    [email protected]     2022-03-01 2022-03-04  Returned
2  42872  [email protected]     2022-03-07 2022-03-30  Accepted
3   9999  [email protected]     2022-03-07 2022-03-30  Rejected
Answered By: mozway

enter image description here

Please be sure to answer the question. Provide details and share your research!

Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.