Convert pandas DataFrame column of comma separated strings to one-hot encoded

Question

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.

For example,

data = {"mesh": ["A, B, C", "C,B", ""]}

From this I would like to get a dataframe consisting of:

index      A       B.     C
0          1       1      1
1          0       1      1
2          0       0      0

How can I do this?

Asked By: scutnex

||

Source

Answer 1

Note that you’re not dealing with OHEs.

`str.split` + `stack` + `get_dummies` + `sum`

df = pd.DataFrame(data)
df

      mesh
0  A, B, C
1      C,B
2         

(df.mesh.str.split('s*,s*', expand=True)
   .stack()
   .str.get_dummies()
   .sum(level=0))
df

   A  B  C
0  1  1  1
1  0  1  1
2  0  0  0

`apply` + `value_counts`

(df.mesh.str.split(r's*,s*', expand=True)
   .apply(pd.Series.value_counts, 1)
   .iloc[:, 1:]
   .fillna(0, downcast='infer'))

   A  B  C
0  1  1  1
1  0  1  1
2  0  0  0

`pd.crosstab`

x = df.mesh.str.split('s*,s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df

col_0  A  B  C
row_0         
0      1  1  1
1      0  1  1
2      0  0  0

Answered By: cs95

Answer 2

Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.

Make sure the column has unique values separated be commas

Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.

data = {"mesh": ["A, B, C", "C,B", ""]}
sof_df=pd.DataFrame(data)
sof_df.mesh=sof_df.mesh.str.replace(' ','')
sof_df.mesh.str.get_dummies(sep=',')

OUTPUT:

    A   B   C
0   1   1   1
1   0   1   1
2   0   0   0

Answered By: Tejeshwar Gurram

Answer 3

If categories are controlled (you know how many and who they are), best answer is by @Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:

s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
 
0    A,B,C,
1     B,C,D
2       NaN
3     X,W,Z
dtype: object

If you are only interested in categories B and C for the final df of dummies, I’ve found this workaround does the job:

cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list()  for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)

   B  C
0  1  0
1  0  1
2  0  0
3  0  0

Answered By: Edu Marín

Convert pandas DataFrame column of comma separated strings to one-hot encoded

Question:

Answers:

`str.split` + `stack` + `get_dummies` + `sum`

`apply` + `value_counts`

`pd.crosstab`

OUTPUT:

Convert pandas DataFrame column of comma separated strings to one-hot encoded

Question:

Answers:

str.split + stack + get_dummies + sum

apply + value_counts

pd.crosstab

OUTPUT:

`str.split` + `stack` + `get_dummies` + `sum`

`apply` + `value_counts`

`pd.crosstab`