Extracting values after a split to create a new column with a yes or no in Python

Question

sampleID	comorbidities
P01	hypertension, diabetes
P02	hypertension, diabetes
P03	diabetes
P04	CHD, asthma
P05	asthma, hypertension

Hello, I am new to coding and am currently working on some data cleaning using Python and I am trying to break apart my data so that I can perform some better analysis. I currently have a few columns that contain multiple strings within one column. For example, one column is the comorbidities of a patient and some patients have multiple comorbidities within that one column. I am trying to split the data, which are strings, so that there is a new column with a simple yes/no or 1/0 for each patient. I am unable to post pictures so I recreated the tables.

Currently I have one column that has multiple strings contained within it. I split the column using:
df1 = pd.concat((df, df['comorbidities'].str.split(',', expand = True)), axis = 1, ignore_index = True)

The resulting dataframe looks like this:

0	1	2	3
P01	hypertension, diabetes	hypertension	diabetes
P02	hypertension, diabetes	hypertension	diabetes
P03	diabetes	diabetes	None
P04	CHD, asthma	CHD	asthma
P05	asthma, hypertension	asthma	hypertension

After this, I am trying to take the split strings and create a new column that will contain either yes/no or 1/0. So that each sample will be able to tell me if they have this or not. Any suggestions as to how to do this? I have tried groupby on just one column, and on all the columns and it does not work. I can’t share the actual data but I created a dummy dataset with an example and the output I want below.

sampleID	comorbidities	hypertension	diabetes	CHD	asthma
P01	hypertension, diabetes	yes	yes	no	no
P02	hypertension, diabetes	yes	yes	no	no
P03	diabetes	no	yes	no	no
P04	CHD, asthma	no	no	yes	yes
P05	asthma, hypertension	yes	no	no	yes

For example, what I am trying to do is take hypertension and create a new column with the name hypertension, and a simple yes/no or 1/0 for each sampleID. Any suggestions would be greatly appreciated!

Asked By: Dana Yang

||

Source

Answer 1

Use str.get_dummies combined with replace and join:

out = df.join(df['comorbidities'].str.get_dummies(', ')
                                 .replace({0: 'no', 1: 'yes'}))

Output:

  sampleID           comorbidities  CHD asthma diabetes hypertension
0      P01  hypertension, diabetes   no     no      yes          yes
1      P02  hypertension, diabetes   no     no      yes          yes
2      P03                diabetes   no     no      yes           no
3      P04             CHD, asthma  yes    yes       no           no
4      P05    asthma, hypertension   no    yes       no          yes

Answered By: mozway

Extracting values after a split to create a new column with a yes or no in Python

Question:

Answers: