Python: create new columns based on ordering (dates) of variables
Question:
I have the following dataframe:
df =
patient_id diagnosis_code diagnosis_date medication_name medication_date
1 Diabetes 2014-08-05 A 2017-12-15
2 Diabetes 2019-06-07 A 2014-03-12
3 Diabetes 2015-06-18 B 2017-11-08
3 Heart Failure 2018-12-25 B 2017-11-08
4 Diabetes 2014-08-11 A 2017-07-07
and I would like to create new columns corresponding to the order in which diagnoses and medications took place:
df_output =
patient_id State_1 State_2 State_3
1 Diabetes A NA
2 A Diabetes NA
3 Diabetes B Heart Failure
4 Diabetes A NA
Ideally, we would have a single row for each patient_id and as many Sates as we have observations for.
Answers:
You can begin converting the diagnosis_date
and medication_date
to pd.to_datetime
, if they are not already. This will be useful when sorting by date in a later step. Concatenate the two pair columns (diagnosis code/date or medication name/date), while renaming the columns, so concat
understands where to place the new rows. Drop any duplicate and sort by the only column remaining with dates, then, group by patient_id
. For each group created you can then apply to_list()
over the single column now containing both the diagnosis code and the medication name. Follow the previous step with apply(pd.Series)
, this away each list item will by placed in a single column. As the last step, rename the columns with the appropriate names.
import pandas as pd
df = pd.read_csv('sample.csv', sep='s+')
print(df)
# if it is not already, convert to datetime
df['diagnosis_date'] = pd.to_datetime(df['diagnosis_date'])
df['medication_date'] = pd.to_datetime(df['medication_date'])
result = pd.concat(
[df[['patient_id', 'medication_name', 'medication_date']],
df[['patient_id', 'diagnosis_code', 'diagnosis_date']]
.set_axis(['patient_id','medication_name', 'medication_date'], axis=1)]
).reset_index(drop=True)
result = result.drop_duplicates().sort_values('medication_date')
g = result.groupby('patient_id')
df_out = g.apply(lambda x: x['medication_name'].to_list()).apply(pd.Series)
# fix column names
df_out.columns = [f'State_{i+1}' for i in df_out.columns]
print(df_out)
Output from df_out
State_1 State_2 State_3
patient_id
1 Diabetes A NaN
2 A Diabetes NaN
3 Diabetes B Heart_Failure
4 Diabetes A NaN
I have the following dataframe:
df =
patient_id diagnosis_code diagnosis_date medication_name medication_date
1 Diabetes 2014-08-05 A 2017-12-15
2 Diabetes 2019-06-07 A 2014-03-12
3 Diabetes 2015-06-18 B 2017-11-08
3 Heart Failure 2018-12-25 B 2017-11-08
4 Diabetes 2014-08-11 A 2017-07-07
and I would like to create new columns corresponding to the order in which diagnoses and medications took place:
df_output =
patient_id State_1 State_2 State_3
1 Diabetes A NA
2 A Diabetes NA
3 Diabetes B Heart Failure
4 Diabetes A NA
Ideally, we would have a single row for each patient_id and as many Sates as we have observations for.
You can begin converting the diagnosis_date
and medication_date
to pd.to_datetime
, if they are not already. This will be useful when sorting by date in a later step. Concatenate the two pair columns (diagnosis code/date or medication name/date), while renaming the columns, so concat
understands where to place the new rows. Drop any duplicate and sort by the only column remaining with dates, then, group by patient_id
. For each group created you can then apply to_list()
over the single column now containing both the diagnosis code and the medication name. Follow the previous step with apply(pd.Series)
, this away each list item will by placed in a single column. As the last step, rename the columns with the appropriate names.
import pandas as pd
df = pd.read_csv('sample.csv', sep='s+')
print(df)
# if it is not already, convert to datetime
df['diagnosis_date'] = pd.to_datetime(df['diagnosis_date'])
df['medication_date'] = pd.to_datetime(df['medication_date'])
result = pd.concat(
[df[['patient_id', 'medication_name', 'medication_date']],
df[['patient_id', 'diagnosis_code', 'diagnosis_date']]
.set_axis(['patient_id','medication_name', 'medication_date'], axis=1)]
).reset_index(drop=True)
result = result.drop_duplicates().sort_values('medication_date')
g = result.groupby('patient_id')
df_out = g.apply(lambda x: x['medication_name'].to_list()).apply(pd.Series)
# fix column names
df_out.columns = [f'State_{i+1}' for i in df_out.columns]
print(df_out)
Output from df_out
State_1 State_2 State_3
patient_id
1 Diabetes A NaN
2 A Diabetes NaN
3 Diabetes B Heart_Failure
4 Diabetes A NaN