Convert pandas DataFrame column of comma separated strings to one-hot encoded
Question:
I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.
For example,
data = {"mesh": ["A, B, C", "C,B", ""]}
From this I would like to get a dataframe consisting of:
index A B. C
0 1 1 1
1 0 1 1
2 0 0 0
How can I do this?
Answers:
Note that you’re not dealing with OHEs.
str.split
+ stack
+ get_dummies
+ sum
df = pd.DataFrame(data)
df
mesh
0 A, B, C
1 C,B
2
(df.mesh.str.split('s*,s*', expand=True)
.stack()
.str.get_dummies()
.sum(level=0))
df
A B C
0 1 1 1
1 0 1 1
2 0 0 0
apply
+ value_counts
(df.mesh.str.split(r's*,s*', expand=True)
.apply(pd.Series.value_counts, 1)
.iloc[:, 1:]
.fillna(0, downcast='infer'))
A B C
0 1 1 1
1 0 1 1
2 0 0 0
pd.crosstab
x = df.mesh.str.split('s*,s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df
col_0 A B C
row_0
0 1 1 1
1 0 1 1
2 0 0 0
Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.
-
Make sure the column has unique values separated be commas
-
Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.
data = {"mesh": ["A, B, C", "C,B", ""]}
sof_df=pd.DataFrame(data)
sof_df.mesh=sof_df.mesh.str.replace(' ','')
sof_df.mesh.str.get_dummies(sep=',')
OUTPUT:
A B C
0 1 1 1
1 0 1 1
2 0 0 0
If categories are controlled (you know how many and who they are), best answer is by @Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:
s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
0 A,B,C,
1 B,C,D
2 NaN
3 X,W,Z
dtype: object
If you are only interested in categories B and C for the final df of dummies, I’ve found this workaround does the job:
cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list() for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)
B C
0 1 0
1 0 1
2 0 0
3 0 0
I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.
For example,
data = {"mesh": ["A, B, C", "C,B", ""]}
From this I would like to get a dataframe consisting of:
index A B. C
0 1 1 1
1 0 1 1
2 0 0 0
How can I do this?
Note that you’re not dealing with OHEs.
str.split
+ stack
+ get_dummies
+ sum
df = pd.DataFrame(data)
df
mesh
0 A, B, C
1 C,B
2
(df.mesh.str.split('s*,s*', expand=True)
.stack()
.str.get_dummies()
.sum(level=0))
df
A B C
0 1 1 1
1 0 1 1
2 0 0 0
apply
+ value_counts
(df.mesh.str.split(r's*,s*', expand=True)
.apply(pd.Series.value_counts, 1)
.iloc[:, 1:]
.fillna(0, downcast='infer'))
A B C
0 1 1 1
1 0 1 1
2 0 0 0
pd.crosstab
x = df.mesh.str.split('s*,s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df
col_0 A B C
row_0
0 1 1 1
1 0 1 1
2 0 0 0
Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.
-
Make sure the column has unique values separated be commas
-
Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.
data = {"mesh": ["A, B, C", "C,B", ""]} sof_df=pd.DataFrame(data) sof_df.mesh=sof_df.mesh.str.replace(' ','') sof_df.mesh.str.get_dummies(sep=',')
OUTPUT:
A B C
0 1 1 1
1 0 1 1
2 0 0 0
If categories are controlled (you know how many and who they are), best answer is by @Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:
s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
0 A,B,C,
1 B,C,D
2 NaN
3 X,W,Z
dtype: object
If you are only interested in categories B and C for the final df of dummies, I’ve found this workaround does the job:
cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list() for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)
B C
0 1 0
1 0 1
2 0 0
3 0 0