How to label-encode comma separated text in a Dataframe column in Python?
Question:
I have dataframe(df) that looks like something like this:
Shape
Weight
Colour
Circle
5
Blue, Red
Square
7
Yellow, Red
Triangle
8
Blue, Yellow, Red
Rectangle
10
Green
I would like to label encode the "Colour" column so that the dataframe looks like this:
Shape
Weight
Blue
Red
Yellow
Green
Circle
5
1
1
0
0
Square
7
0
1
1
0
Triangle
8
1
1
1
0
Rectangle
10
0
0
0
1
Is there an easy function to do this type of conversion ? Any pointers in the right direction would be appreciated. Thanks.
Answers:
Try:
df["Colour"] = df["Colour"].str.split(r"s*,s*", regex=True)
x = df.explode("Colour")
df_out = (
pd.concat(
[df.set_index("Shape"), pd.crosstab(x["Shape"], x["Colour"])], axis=1
)
.reset_index()
.drop(columns="Colour")
)
print(df_out)
Prints:
Shape Weight Blue Green Red Yellow
0 Circle 5 1 0 1 0
1 Square 7 0 0 1 1
2 Triangle 8 1 0 1 1
3 Rectangle 10 0 1 0 0
I have dataframe(df) that looks like something like this:
Shape | Weight | Colour |
---|---|---|
Circle | 5 | Blue, Red |
Square | 7 | Yellow, Red |
Triangle | 8 | Blue, Yellow, Red |
Rectangle | 10 | Green |
I would like to label encode the "Colour" column so that the dataframe looks like this:
Shape | Weight | Blue | Red | Yellow | Green |
---|---|---|---|---|---|
Circle | 5 | 1 | 1 | 0 | 0 |
Square | 7 | 0 | 1 | 1 | 0 |
Triangle | 8 | 1 | 1 | 1 | 0 |
Rectangle | 10 | 0 | 0 | 0 | 1 |
Is there an easy function to do this type of conversion ? Any pointers in the right direction would be appreciated. Thanks.
Try:
df["Colour"] = df["Colour"].str.split(r"s*,s*", regex=True)
x = df.explode("Colour")
df_out = (
pd.concat(
[df.set_index("Shape"), pd.crosstab(x["Shape"], x["Colour"])], axis=1
)
.reset_index()
.drop(columns="Colour")
)
print(df_out)
Prints:
Shape Weight Blue Green Red Yellow
0 Circle 5 1 0 1 0
1 Square 7 0 0 1 1
2 Triangle 8 1 0 1 1
3 Rectangle 10 0 1 0 0