target encoding train and test data set with many categorical columns

Question

I am trying to prepare a training dataset which contains many categorical columns with high cardinality to train a machine learning model. Therefore, I want to target encoding them so that I convert the categorical columns into numerical columns. Label encoding is not suitable because the categorical features are not ordinal

My train dataset looks like this where I have only taken 4 columns out of 20 columns

target	cat_col1	cat_col2	cat_col3	cat_col4
10	city1	james	25-55	abc
20	city2	adam	30-40	bcc
15	city1	charles	30-40	bcc

I want to write an efficient code to target encode all the categorical columns without individually having to do each column.

The resulting training dataframe should look like

target	cat_col1	cat_col2	cat_col3	cat_col4
10	15	10	10	10
20	20	20	17	17
15	15	15	17	17

I can get the above output by writing code for each column but since I have 20 categorical, this does not seem efficient.

encoder = TargetEncoder()
train['cat_col1'] = encoder.fit_transform(train['cat_col1'], train['target'])
train['cat_col2'] = encoder.fit_transform(train['cat_col2'], train['target'])
train['cat_col3'] = encoder.fit_transform(train['cat_col3'], train['target'])
train['cat_col4'] = encoder.fit_transform(train['cat_col4'], train['target'])

In addition, I would like to take the target encoded values of the train dataframe and replace all the categories in the test dataframe with the train target encoded values.

Asked By: Datalearner

||

Source

Answer 1

Assuming you’re using the category_encoders implementation, it should accept several columns just fine, at least for the recent versions:

cat_cols = ['cat_col1', 'cat_col2', 'cat_col3', 'cat_col4']

train[cat_cols] = encoder.fit_transform(train[cat_cols], train['target'])
test[cat_cols] = encoder.transform(test[cat_cols])

Alternatively, you could use a loop:

for column in cat_cols:
    encoder = TargetEncoder()
    train[column] = encoder.fit_transform(train[column], train['target'])
    test[column] = encoder.transform(test[column])

Answered By: dx2-66

target encoding train and test data set with many categorical columns

Question:

Answers: