How to label-encode comma separated text in a Dataframe column in Python?

Question:

I have dataframe(df) that looks like something like this:

Shape Weight Colour
Circle 5 Blue, Red
Square 7 Yellow, Red
Triangle 8 Blue, Yellow, Red
Rectangle 10 Green

I would like to label encode the "Colour" column so that the dataframe looks like this:

Shape Weight Blue Red Yellow Green
Circle 5 1 1 0 0
Square 7 0 1 1 0
Triangle 8 1 1 1 0
Rectangle 10 0 0 0 1

Is there an easy function to do this type of conversion ? Any pointers in the right direction would be appreciated. Thanks.

Asked By: ScottC

||

Answers:

Try:

df["Colour"] = df["Colour"].str.split(r"s*,s*", regex=True)
x = df.explode("Colour")

df_out = (
    pd.concat(
        [df.set_index("Shape"), pd.crosstab(x["Shape"], x["Colour"])], axis=1
    )
    .reset_index()
    .drop(columns="Colour")
)
print(df_out)

Prints:

       Shape  Weight  Blue  Green  Red  Yellow
0     Circle       5     1      0    1       0
1     Square       7     0      0    1       1
2   Triangle       8     1      0    1       1
3  Rectangle      10     0      1    0       0
Answered By: Andrej Kesely
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.