How to add your own categories into the OneHotEncoder

Question:

I have data of format, for example, [‘1’, ‘5’ ‘6’, ‘5’, ‘2’], where each label can be a character of any number between 0-9. The nature of my data is nominal categorical, so it is unordered, and subsequently, I use the scikit OneHotEncoder to categorically encode my data. However, I run into an error when testing the model when, let’s say I have [‘1’, ‘5’, ‘9’, ‘3’, ‘1’] when there was no input where the third index of the array was ‘9’.

I think this is being caused because when I encode the data, and let’s say in the test data the third index only has numbers from ‘0’-‘8’, the OneHotEncoder doesn’t recognize when a ‘9’ is in the third index, and throws an error. I’m wondering if there is a way to manually add these categories, so in the ML model the category would be there and just have no data points on it.

Example:

from sklearn.preprocessing import OneHotEncoder

a = [['1'], ['2'], ['3'], ['5']]
enc = OneHotEncoder()
X = enc.fit_transform(a)
enc.transform([['4']])

You can see that my training data does not contain ‘4’, even though ‘4’ is a possible label. so when I encode it and transform ‘4’, it throws an error:

ValueError: Found unknown categories ['4'] in column 0 during transform

I’m wondering how I could manually add ‘4’ as a category.

Asked By: Dan8757

||

Answers:

There can be two cases here.

  1. If you know all the categories beforehand.

Pass all the possible categories as a list when OneHot Encoder is initialized.

enc = OneHotEncoder(categories = [str(i) for i in range(10)])
  1. If you don’t know some categories beforehand.
# This argument by default is set to `error` hence throws error is an unknown
# category is encountered.
enc = OneHotEncoder(handle_unknown='ignore')

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

This case is also asked here

Refer here for detailed documentation about every parameter.

Answered By: ranka47

I met a similar problem. And my solution is like this:

enc = OneHotEncoder(drop='first')
trans = enc.fit_transform(X)  # X is an array of shape (n, m)
print(enc.categories_)

And then you can reset the categories according to the result.

Answered By: Junming Liang