Python: create new columns based on ordering (dates) of variables

Question:

I have the following dataframe:

df =
patient_id  diagnosis_code  diagnosis_date  medication_name medication_date
1           Diabetes        2014-08-05      A               2017-12-15
2           Diabetes        2019-06-07      A               2014-03-12
3           Diabetes        2015-06-18      B               2017-11-08
3           Heart Failure   2018-12-25      B               2017-11-08
4           Diabetes        2014-08-11      A               2017-07-07

and I would like to create new columns corresponding to the order in which diagnoses and medications took place:

df_output =
patient_id  State_1    State_2  State_3
1           Diabetes   A        NA        
2           A          Diabetes NA              
3           Diabetes   B        Heart Failure                        
4           Diabetes   A        NA              

Ideally, we would have a single row for each patient_id and as many Sates as we have observations for.

Answers:

You can begin converting the diagnosis_date and medication_date to pd.to_datetime, if they are not already. This will be useful when sorting by date in a later step. Concatenate the two pair columns (diagnosis code/date or medication name/date), while renaming the columns, so concat understands where to place the new rows. Drop any duplicate and sort by the only column remaining with dates, then, group by patient_id. For each group created you can then apply to_list() over the single column now containing both the diagnosis code and the medication name. Follow the previous step with apply(pd.Series), this away each list item will by placed in a single column. As the last step, rename the columns with the appropriate names.

import pandas as pd

df = pd.read_csv('sample.csv', sep='s+')
print(df)

# if it is not already, convert to datetime
df['diagnosis_date'] = pd.to_datetime(df['diagnosis_date'])
df['medication_date'] = pd.to_datetime(df['medication_date'])

result = pd.concat(
    [df[['patient_id', 'medication_name', 'medication_date']],
     df[['patient_id', 'diagnosis_code', 'diagnosis_date']]
        .set_axis(['patient_id','medication_name', 'medication_date'], axis=1)]
    ).reset_index(drop=True)

result = result.drop_duplicates().sort_values('medication_date')
g = result.groupby('patient_id')

df_out = g.apply(lambda x: x['medication_name'].to_list()).apply(pd.Series)
# fix column names
df_out.columns = [f'State_{i+1}' for i in df_out.columns]
print(df_out)

Output from df_out

             State_1   State_2        State_3
patient_id
1           Diabetes         A            NaN
2                  A  Diabetes            NaN
3           Diabetes         B  Heart_Failure
4           Diabetes         A            NaN
Answered By: n1colas.m
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.