How to handle category mismatch after onehotencoding from test data while predicting?

Question:

I’m sorry if the title of the question is not that clear, I could not sum up the problem in one line.

Here are the simplified datasets for an explanation. Basically, the number of categories in the training set is much larger than the categories in the test set, because of which there is a difference in the number of columns in the test and training set after OneHotEncoding. How can I handle this problem?

Training Set

+-------+----------+
| Value | Category |
+-------+----------+
| 100   | SE1      |
+-------+----------+
| 200   | SE2      |
+-------+----------+
| 300   | SE3      |
+-------+----------+

Training set after OneHotEncoding

+-------+-----------+-----------+-----------+
| Value | DummyCat1 | DummyCat2 | DummyCat3 |
+-------+-----------+-----------+-----------+
| 100   | 1         | 0         | 0         |
+-------+-----------+-----------+-----------+
| 200   | 0         | 1         | 0         |
+-------+-----------+-----------+-----------+
| 300   | 0         | 0         | 1         |
+-------+-----------+-----------+-----------+

Test Set

+-------+----------+
| Value | Category |
+-------+----------+
| 100   | SE1      |
+-------+----------+
| 200   | SE1      |
+-------+----------+
| 300   | SE2      |
+-------+----------+

Test set after OneHotEncoding

+-------+-----------+-----------+
| Value | DummyCat1 | DummyCat2 |
+-------+-----------+-----------+
| 100   | 1         | 0         |
+-------+-----------+-----------+
| 200   | 1         | 0         |
+-------+-----------+-----------+
| 300   | 0         | 1         |
+-------+-----------+-----------+

As you can notice, the training set after the OneHotEncoding is of shape (3,4) while the test set after OneHotEncoding is of shape (3,3).
Because of this, when I do the following code (y_train is a column vector of shape (3,))

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

x_pred = regressor.predict(x_test)

I get the error at the predict function. As you can see, the dimensions in the error are quite large, unlike the basic examples.

  Traceback (most recent call last):

  File "<ipython-input-2-5bac76b24742>", line 30, in <module>
    x_pred = regressor.predict(x_test)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 256, in predict
    return self._decision_function(X)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 241, in _decision_function
    dense_output=True) + self.intercept_

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 140, in safe_sparse_dot
    return np.dot(a, b)

ValueError: shapes (4801,2236) and (4033,) not aligned: 2236 (dim 1) != 4033 (dim 0)
Asked By: Parthapratim Neog

||

Answers:

You have to transform the x_test the same way in which x_train was transformed.

x_test = onehotencoder.transform(x_test)
x_pred = regressor.predict(x_test)

Make sure use the same onehotencoder object which was used to fit() on x_train.

I’m assuming that you are currently using fit_transform() on test data.
Doing fit() or fit_transform() forgets the previously learnt data and re-fits the oneHotEncoder. It will now think that there are only two distinct values present in the column and hence will change the shape of output.

Answered By: Vivek Kumar

There are two cases:

i) train data feature/column having more categories than test column

ii) test data feature/column having more categories than corresponding train column

test data should only be transformed using encoding, not fit&transform.

The general case of OHE usage:

onehotencoder=OneHotEncoder()
enc_data_train=onehotencoder.fit_transform(X_train[cat_columns]).toarray())
X_train=X_train[num_columns].join(enc_data_train)

enc_data_test=onehotencoder.transform(X_test[cat_columns]).toarray())
X_test=X_test[num_columns].join(enc_data_test)

cat_columns are the categorical columns and num_columns are the numerical columns.
You never fit X_test. The following code is wrong

X_test=onehotencoder.fit_transform(X_test[cat_columns]).toarray()). 

this is not how we should encode test data.

Now coming to the problem of mismatch.Train or Test having different number of categories, so different number of columns.

Two ways to solve it:

i) fit using the entire data (train & test) and only transform X_train and X_test

ii) ignore new features in test

i) example code

onehotencoder=OneHotEncoder()
enc_data=onehotencoder.fit(X[cat_columns])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0, 
                                   train_size = .75)


enc_data_train= pd.DataFrame 
                (onehotencoder.transform(X_train[cat_columns]).toarray())

X_train=X_train[[num_columns]].join(enc_data_train)


enc_data_test=pd.DataFrame 
              (onehotencoder.transform(X_test[cat_columns]).toarray())

X_test=X_test[[num_columns]].join(enc_data_test)

ii) using handle_unknown, fit&transform using train

onehotencoder=OneHotEncoder(handle_unknown='ignore')

enc_data_train=pd.DataFrame 
               (onehotencoder.fit_transform(X_train[cat_columns]).toarray())

X_train=X_train[[num_columns]].join(enc_data_train)

enc_data_test=pd.DataFrame 
             (onehotencoder.transform(X_test[cat_columns]).toarray())

X_test=X_test[[num_columns]].join(enc_data_test)

This second way will ignore the new features in test. The number of columns will be same in both train and test. This method assumes that the representation of new category in test is less significant. Even if the new category is significant, it is not taken in training model as it wasn’t present in training set, so the impact won’t be reflected in the model.

Answered By: Vaishnavi S