Extracting values after a split to create a new column with a yes or no in Python

Question:

sampleID comorbidities
P01 hypertension, diabetes
P02 hypertension, diabetes
P03 diabetes
P04 CHD, asthma
P05 asthma, hypertension

Hello, I am new to coding and am currently working on some data cleaning using Python and I am trying to break apart my data so that I can perform some better analysis. I currently have a few columns that contain multiple strings within one column. For example, one column is the comorbidities of a patient and some patients have multiple comorbidities within that one column. I am trying to split the data, which are strings, so that there is a new column with a simple yes/no or 1/0 for each patient. I am unable to post pictures so I recreated the tables.

Currently I have one column that has multiple strings contained within it. I split the column using:
df1 = pd.concat((df, df['comorbidities'].str.split(',', expand = True)), axis = 1, ignore_index = True)

The resulting dataframe looks like this:

0 1 2 3
P01 hypertension, diabetes hypertension diabetes
P02 hypertension, diabetes hypertension diabetes
P03 diabetes diabetes None
P04 CHD, asthma CHD asthma
P05 asthma, hypertension asthma hypertension

After this, I am trying to take the split strings and create a new column that will contain either yes/no or 1/0. So that each sample will be able to tell me if they have this or not. Any suggestions as to how to do this? I have tried groupby on just one column, and on all the columns and it does not work. I can’t share the actual data but I created a dummy dataset with an example and the output I want below.

sampleID comorbidities hypertension diabetes CHD asthma
P01 hypertension, diabetes yes yes no no
P02 hypertension, diabetes yes yes no no
P03 diabetes no yes no no
P04 CHD, asthma no no yes yes
P05 asthma, hypertension yes no no yes

For example, what I am trying to do is take hypertension and create a new column with the name hypertension, and a simple yes/no or 1/0 for each sampleID. Any suggestions would be greatly appreciated!

Asked By: Dana Yang

||

Answers:

Use str.get_dummies combined with replace and join:

out = df.join(df['comorbidities'].str.get_dummies(', ')
                                 .replace({0: 'no', 1: 'yes'}))

Output:

  sampleID           comorbidities  CHD asthma diabetes hypertension
0      P01  hypertension, diabetes   no     no      yes          yes
1      P02  hypertension, diabetes   no     no      yes          yes
2      P03                diabetes   no     no      yes           no
3      P04             CHD, asthma  yes    yes       no           no
4      P05    asthma, hypertension   no    yes       no          yes
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.