Convert pandas DataFrame column of comma separated strings to one-hot encoded

Question:

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.

For example,

data = {"mesh": ["A, B, C", "C,B", ""]}

From this I would like to get a dataframe consisting of:

index      A       B.     C
0          1       1      1
1          0       1      1
2          0       0      0

How can I do this?

Asked By: scutnex

||

Answers:

Note that you’re not dealing with OHEs.

str.split + stack + get_dummies + sum

df = pd.DataFrame(data)
df

      mesh
0  A, B, C
1      C,B
2         

(df.mesh.str.split('s*,s*', expand=True)
   .stack()
   .str.get_dummies()
   .sum(level=0))
df

   A  B  C
0  1  1  1
1  0  1  1
2  0  0  0

apply + value_counts

(df.mesh.str.split(r's*,s*', expand=True)
   .apply(pd.Series.value_counts, 1)
   .iloc[:, 1:]
   .fillna(0, downcast='infer'))

   A  B  C
0  1  1  1
1  0  1  1
2  0  0  0

pd.crosstab

x = df.mesh.str.split('s*,s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df

col_0  A  B  C
row_0         
0      1  1  1
1      0  1  1
2      0  0  0
Answered By: cs95

Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.

  1. Make sure the column has unique values separated be commas

  2. Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.

    data = {"mesh": ["A, B, C", "C,B", ""]}
    sof_df=pd.DataFrame(data)
    sof_df.mesh=sof_df.mesh.str.replace(' ','')
    sof_df.mesh.str.get_dummies(sep=',')
    

OUTPUT:

    A   B   C
0   1   1   1
1   0   1   1
2   0   0   0
Answered By: Tejeshwar Gurram

If categories are controlled (you know how many and who they are), best answer is by @Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:

s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
 
0    A,B,C,
1     B,C,D
2       NaN
3     X,W,Z
dtype: object

If you are only interested in categories B and C for the final df of dummies, I’ve found this workaround does the job:

cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list()  for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)

   B  C
0  1  0
1  0  1
2  0  0
3  0  0
Answered By: Edu Marín
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.