Using OrdinalEncoder to transform categorical values

Question:

I have a dataset which has the following columns:

No  Name  Sex  Blood  Grade  Height  Study
1   Tom   M    O      56     160     Math
2   Harry M    A      76     192     Math
3   John  M    A      45     178     English
4   Nancy F    B      78     157     Biology
5   Mike  M    O      79     167     Math
6   Kate  F    AB     66     156     English
7   Mary  F    O      99     166     Science

I want to change it to be something like this:

No  Name  Sex  Blood  Grade  Height  Study
1   Tom   0    0      56     160     0
2   Harry 0    1      76     192     0
3   John  0    1      45     178     1
4   Nancy 1    2      78     157     2
5   Mike  0    0      79     167     0
6   Kate  1    3      66     156     1
7   Mary  0    0      99     166     3

I know there is a library that can do it

from sklearn.preprocessing import OrdinalEncoder

Which I’ve tried this but it did not work

enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])

Can anyone help me find what I am doing wrong and how to do it?

Asked By: asmgx

||

Answers:

You were almost there !

Basically the fit method, prepare the encoder (fit on your data i.e. prepare the mapping) but don’t transform the data.

You have to call transform to transform the data , or use fit_transform which fit and transform the same data.

enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
df[["Sex","Blood", "Study"]] = enc.transform(df[["Sex","Blood", "Study"]])

or directly

enc = OrdinalEncoder()
df[["Sex","Blood", "Study"]] = enc.fit_transform(df[["Sex","Blood", "Study"]])

Note: The values won’t be the one that you provided, since internally the fit method use numpy.unique which gives result sorted in alphabetic order and not by order of appearance.

As you can see from enc.categories_

[array(['F', 'M'], dtype=object),
 array(['A', 'AB', 'B', 'O'], dtype=object),
 array(['Biology', 'English', 'Math', 'Science'], dtype=object)]```

Each value in the array is encoded by it’s position.
(F will be encoded as 0 , M as 1)

Answered By: abcdaire

I think it is important to point out that this is not an example for an ordinal encoding of variables. Sex, Blood and Study should all not have an ordinal scale (and was also not suggested by the person, who asked the question). Ordinal data has a ranking (see e.g. https://en.wikipedia.org/wiki/Ordinal_data) Those examples here do not have a ranking.

In the case that your variable is a target variable you can use the LabelEncoder.(https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

Then you can do something like:

from sklearn.preprocessing import LabelEncoder

for col in ["Sex","Blood", "Study"]:
    df[col] = LabelEncoder().fit_transform(df[col])

If your variables are features you should use the Ordinalencoder for accomplishing this. (See comments to my answer).

The naming for the Ordinalencoder is quite unfortunate as "ordinal" is seen from a mathematical and not a statistical naming perspective.

More on the difference between ordinal- and labelencoder in sklearn: https://datascience.stackexchange.com/questions/39317/difference-between-ordinalencoder-and-labelencoder

Answered By: Createdd

Here is a simple example to apply ordinal encoding using sklearn apply on dataframe.

import pandas as pd

df = pd.DataFrame(
    {
        "gender": ["man", "women", "child", "man", "women", "child"],
        "age": [40, 40, 10, 50, 50, 8],
    }
)


def ordinal_encoding(genders):
    le = LabelEncoder()
    le.fit(genders)
    return le.transform(genders)


encoded_genders = ordinal_encoding(df["gender"])
Answered By: Muhammad Faizan

Here is my opinion:

First create the encoder:

enc = OrdinalEncoder()

The names of the columns which their values are needed to be transformed are:

Sex, Blood, Study

Use enc.fit_transform() to fit and then transform the values of each column to numbers as shown below:

X_enc = enc.fit_transform(df["Sex", "Blood", "Study"])

Finally, replace these transformed values with the original ones (which are in the main dataframe):

df["Sex", "Blood", "Study"] = pd.DataFrame(X_enc, columns=["Sex", "Blood", "Study"])

The answer:

No  Name   Sex  Blood  Grade  Height  Study
1   Tom    1.0  3.0    56     160     2.0
2   Harry  1.0  0.0    76     192     2.0
3   John   1.0  0.0    45     178     1.0
4   Nancy  0.0  2.0    78     157     0.0
5   Mike   1.0  3.0    79     167     2.0
6   Kate   0.0  1.0    66     156     1.0
7   Mary   0.0  3.0    99     166     3.0
Answered By: Saber Vatankhah

@Createdd is right. Even though "Sex", "Blood" and "Study" are categorical attributes, there are 2 kinds of categorical attributes: ordinal and nominal.

If you use OrdinalImputer for a nominal attribute most machine learning models will make the following assumption: Math (0) < English (1) < Biology (2) < Science (3). When in reality this should not be the case: "English" is not between "Math" and "Biology" or in any other order. A real ordinal attribute would be something like a rating: "Very Bad" (0), "Bad" (1), "Neutral" (2), "Good" (3), "Very Good" (4).

The correct answer should be to use OneHotEncoder for the "Sex", "Blood", "Study" attributes (because they are nominal attributes).

Answered By: Erik Varga
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.