Using OrdinalEncoder to transform categorical values
Question:
I have a dataset which has the following columns:
No Name Sex Blood Grade Height Study
1 Tom M O 56 160 Math
2 Harry M A 76 192 Math
3 John M A 45 178 English
4 Nancy F B 78 157 Biology
5 Mike M O 79 167 Math
6 Kate F AB 66 156 English
7 Mary F O 99 166 Science
I want to change it to be something like this:
No Name Sex Blood Grade Height Study
1 Tom 0 0 56 160 0
2 Harry 0 1 76 192 0
3 John 0 1 45 178 1
4 Nancy 1 2 78 157 2
5 Mike 0 0 79 167 0
6 Kate 1 3 66 156 1
7 Mary 0 0 99 166 3
I know there is a library that can do it
from sklearn.preprocessing import OrdinalEncoder
Which I’ve tried this but it did not work
enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
Can anyone help me find what I am doing wrong and how to do it?
Answers:
You were almost there !
Basically the fit
method, prepare the encoder (fit on your data i.e. prepare the mapping) but don’t transform the data.
You have to call transform
to transform the data , or use fit_transform
which fit and transform the same data.
enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
df[["Sex","Blood", "Study"]] = enc.transform(df[["Sex","Blood", "Study"]])
or directly
enc = OrdinalEncoder()
df[["Sex","Blood", "Study"]] = enc.fit_transform(df[["Sex","Blood", "Study"]])
Note: The values won’t be the one that you provided, since internally the fit method use numpy.unique
which gives result sorted in alphabetic order and not by order of appearance.
As you can see from enc.categories_
[array(['F', 'M'], dtype=object),
array(['A', 'AB', 'B', 'O'], dtype=object),
array(['Biology', 'English', 'Math', 'Science'], dtype=object)]```
Each value in the array is encoded by it’s position.
(F will be encoded as 0 , M as 1)
I think it is important to point out that this is not an example for an ordinal encoding of variables. Sex, Blood and Study should all not have an ordinal scale (and was also not suggested by the person, who asked the question). Ordinal data has a ranking (see e.g. https://en.wikipedia.org/wiki/Ordinal_data) Those examples here do not have a ranking.
In the case that your variable is a target variable you can use the LabelEncoder.(https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)
Then you can do something like:
from sklearn.preprocessing import LabelEncoder
for col in ["Sex","Blood", "Study"]:
df[col] = LabelEncoder().fit_transform(df[col])
If your variables are features you should use the Ordinalencoder for accomplishing this. (See comments to my answer).
The naming for the Ordinalencoder is quite unfortunate as "ordinal" is seen from a mathematical and not a statistical naming perspective.
More on the difference between ordinal- and labelencoder in sklearn: https://datascience.stackexchange.com/questions/39317/difference-between-ordinalencoder-and-labelencoder
Here is a simple example to apply ordinal encoding using sklearn apply on dataframe.
import pandas as pd
df = pd.DataFrame(
{
"gender": ["man", "women", "child", "man", "women", "child"],
"age": [40, 40, 10, 50, 50, 8],
}
)
def ordinal_encoding(genders):
le = LabelEncoder()
le.fit(genders)
return le.transform(genders)
encoded_genders = ordinal_encoding(df["gender"])
Here is my opinion:
First create the encoder:
enc = OrdinalEncoder()
The names of the columns which their values are needed to be transformed are:
Sex, Blood, Study
Use enc.fit_transform()
to fit and then transform the values of each column to numbers as shown below:
X_enc = enc.fit_transform(df["Sex", "Blood", "Study"])
Finally, replace these transformed values with the original ones (which are in the main dataframe):
df["Sex", "Blood", "Study"] = pd.DataFrame(X_enc, columns=["Sex", "Blood", "Study"])
The answer:
No Name Sex Blood Grade Height Study
1 Tom 1.0 3.0 56 160 2.0
2 Harry 1.0 0.0 76 192 2.0
3 John 1.0 0.0 45 178 1.0
4 Nancy 0.0 2.0 78 157 0.0
5 Mike 1.0 3.0 79 167 2.0
6 Kate 0.0 1.0 66 156 1.0
7 Mary 0.0 3.0 99 166 3.0
@Createdd is right. Even though "Sex", "Blood" and "Study" are categorical attributes, there are 2 kinds of categorical attributes: ordinal and nominal.
If you use OrdinalImputer for a nominal attribute most machine learning models will make the following assumption: Math (0) < English (1) < Biology (2) < Science (3). When in reality this should not be the case: "English" is not between "Math" and "Biology" or in any other order. A real ordinal attribute would be something like a rating: "Very Bad" (0), "Bad" (1), "Neutral" (2), "Good" (3), "Very Good" (4).
The correct answer should be to use OneHotEncoder for the "Sex", "Blood", "Study" attributes (because they are nominal attributes).
I have a dataset which has the following columns:
No Name Sex Blood Grade Height Study
1 Tom M O 56 160 Math
2 Harry M A 76 192 Math
3 John M A 45 178 English
4 Nancy F B 78 157 Biology
5 Mike M O 79 167 Math
6 Kate F AB 66 156 English
7 Mary F O 99 166 Science
I want to change it to be something like this:
No Name Sex Blood Grade Height Study
1 Tom 0 0 56 160 0
2 Harry 0 1 76 192 0
3 John 0 1 45 178 1
4 Nancy 1 2 78 157 2
5 Mike 0 0 79 167 0
6 Kate 1 3 66 156 1
7 Mary 0 0 99 166 3
I know there is a library that can do it
from sklearn.preprocessing import OrdinalEncoder
Which I’ve tried this but it did not work
enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
Can anyone help me find what I am doing wrong and how to do it?
You were almost there !
Basically the fit
method, prepare the encoder (fit on your data i.e. prepare the mapping) but don’t transform the data.
You have to call transform
to transform the data , or use fit_transform
which fit and transform the same data.
enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
df[["Sex","Blood", "Study"]] = enc.transform(df[["Sex","Blood", "Study"]])
or directly
enc = OrdinalEncoder()
df[["Sex","Blood", "Study"]] = enc.fit_transform(df[["Sex","Blood", "Study"]])
Note: The values won’t be the one that you provided, since internally the fit method use numpy.unique
which gives result sorted in alphabetic order and not by order of appearance.
As you can see from enc.categories_
[array(['F', 'M'], dtype=object),
array(['A', 'AB', 'B', 'O'], dtype=object),
array(['Biology', 'English', 'Math', 'Science'], dtype=object)]```
Each value in the array is encoded by it’s position.
(F will be encoded as 0 , M as 1)
I think it is important to point out that this is not an example for an ordinal encoding of variables. Sex, Blood and Study should all not have an ordinal scale (and was also not suggested by the person, who asked the question). Ordinal data has a ranking (see e.g. https://en.wikipedia.org/wiki/Ordinal_data) Those examples here do not have a ranking.
In the case that your variable is a target variable you can use the LabelEncoder.(https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)
Then you can do something like:
from sklearn.preprocessing import LabelEncoder
for col in ["Sex","Blood", "Study"]:
df[col] = LabelEncoder().fit_transform(df[col])
If your variables are features you should use the Ordinalencoder for accomplishing this. (See comments to my answer).
The naming for the Ordinalencoder is quite unfortunate as "ordinal" is seen from a mathematical and not a statistical naming perspective.
More on the difference between ordinal- and labelencoder in sklearn: https://datascience.stackexchange.com/questions/39317/difference-between-ordinalencoder-and-labelencoder
Here is a simple example to apply ordinal encoding using sklearn apply on dataframe.
import pandas as pd
df = pd.DataFrame(
{
"gender": ["man", "women", "child", "man", "women", "child"],
"age": [40, 40, 10, 50, 50, 8],
}
)
def ordinal_encoding(genders):
le = LabelEncoder()
le.fit(genders)
return le.transform(genders)
encoded_genders = ordinal_encoding(df["gender"])
Here is my opinion:
First create the encoder:
enc = OrdinalEncoder()
The names of the columns which their values are needed to be transformed are:
Sex, Blood, Study
Use enc.fit_transform()
to fit and then transform the values of each column to numbers as shown below:
X_enc = enc.fit_transform(df["Sex", "Blood", "Study"])
Finally, replace these transformed values with the original ones (which are in the main dataframe):
df["Sex", "Blood", "Study"] = pd.DataFrame(X_enc, columns=["Sex", "Blood", "Study"])
The answer:
No Name Sex Blood Grade Height Study
1 Tom 1.0 3.0 56 160 2.0
2 Harry 1.0 0.0 76 192 2.0
3 John 1.0 0.0 45 178 1.0
4 Nancy 0.0 2.0 78 157 0.0
5 Mike 1.0 3.0 79 167 2.0
6 Kate 0.0 1.0 66 156 1.0
7 Mary 0.0 3.0 99 166 3.0
@Createdd is right. Even though "Sex", "Blood" and "Study" are categorical attributes, there are 2 kinds of categorical attributes: ordinal and nominal.
If you use OrdinalImputer for a nominal attribute most machine learning models will make the following assumption: Math (0) < English (1) < Biology (2) < Science (3). When in reality this should not be the case: "English" is not between "Math" and "Biology" or in any other order. A real ordinal attribute would be something like a rating: "Very Bad" (0), "Bad" (1), "Neutral" (2), "Good" (3), "Very Good" (4).
The correct answer should be to use OneHotEncoder for the "Sex", "Blood", "Study" attributes (because they are nominal attributes).