Ordinal encoding in Pandas
Question:
Is there a way to have pandas.get_dummies
output the numerical representation in one column rather than a separate column for each option?
Concretely, currently when using pandas.get_dummies
it gives me a column for every option:
Size
Size_Big
Size_Medium
Size_Small
Big
1
0
0
Medium
0
1
0
Small
0
0
1
But I’m looking for more of the following output:
Size
Size_Numerical
Big
1
Medium
2
Small
3
Answers:
If using Pandas isn’t an absolute requirement, sklearn has an OrdinalEncoder that does exactly that (source)
I think OneHotEncoding has a similar issue that it expands and creates n-dimensions as labels. You need to use LabelEncoder so that:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['Sizes'])
df['Category'] = le.transform(df['Sizes']) + 1
Outputs:
Sizes Category
0 Small 3
1 Medium 2
2 Large 1
You don’t want dummies, you want factors/categories.
Use pandas.factorize
:
df['Size_Numerical'] = pd.factorize(df['Size'])[0] + 1
output:
Size Size_Numerical
0 Big 1
1 Medium 2
2 Small 3
With category
, you could do
(
dataf
.astype({"Size":"category"})
.assign(Size_Numerical = lambda d : d["Size"].cat.rename_categories({"Big": 1, "Medium": 2, "Small": 3})
)
)
Tested with data
import pandas as pd
dataf = pd.DataFrame({'Size':["Big", "Medium", "Small","Medium"]})
You can convert it to the Categorical
type and get codes
:
pd.Categorical(['A', 'B', 'C', 'A', 'C']).codes
Output:
array([0, 1, 2, 0, 2], dtype=int8)
Is there a way to have pandas.get_dummies
output the numerical representation in one column rather than a separate column for each option?
Concretely, currently when using pandas.get_dummies
it gives me a column for every option:
Size | Size_Big | Size_Medium | Size_Small |
---|---|---|---|
Big | 1 | 0 | 0 |
Medium | 0 | 1 | 0 |
Small | 0 | 0 | 1 |
But I’m looking for more of the following output:
Size | Size_Numerical |
---|---|
Big | 1 |
Medium | 2 |
Small | 3 |
If using Pandas isn’t an absolute requirement, sklearn has an OrdinalEncoder that does exactly that (source)
I think OneHotEncoding has a similar issue that it expands and creates n-dimensions as labels. You need to use LabelEncoder so that:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['Sizes'])
df['Category'] = le.transform(df['Sizes']) + 1
Outputs:
Sizes Category
0 Small 3
1 Medium 2
2 Large 1
You don’t want dummies, you want factors/categories.
Use pandas.factorize
:
df['Size_Numerical'] = pd.factorize(df['Size'])[0] + 1
output:
Size Size_Numerical
0 Big 1
1 Medium 2
2 Small 3
With category
, you could do
(
dataf
.astype({"Size":"category"})
.assign(Size_Numerical = lambda d : d["Size"].cat.rename_categories({"Big": 1, "Medium": 2, "Small": 3})
)
)
Tested with data
import pandas as pd
dataf = pd.DataFrame({'Size':["Big", "Medium", "Small","Medium"]})
You can convert it to the Categorical
type and get codes
:
pd.Categorical(['A', 'B', 'C', 'A', 'C']).codes
Output:
array([0, 1, 2, 0, 2], dtype=int8)