target encoding train and test data set with many categorical columns
Question:
I am trying to prepare a training dataset which contains many categorical columns with high cardinality to train a machine learning model. Therefore, I want to target encoding them so that I convert the categorical columns into numerical columns. Label encoding is not suitable because the categorical features are not ordinal
My train dataset looks like this where I have only taken 4 columns out of 20 columns
target
cat_col1
cat_col2
cat_col3
cat_col4
10
city1
james
25-55
abc
20
city2
adam
30-40
bcc
15
city1
charles
30-40
bcc
I want to write an efficient code to target encode all the categorical columns without individually having to do each column.
The resulting training dataframe should look like
target
cat_col1
cat_col2
cat_col3
cat_col4
10
15
10
10
10
20
20
20
17
17
15
15
15
17
17
I can get the above output by writing code for each column but since I have 20 categorical, this does not seem efficient.
encoder = TargetEncoder()
train['cat_col1'] = encoder.fit_transform(train['cat_col1'], train['target'])
train['cat_col2'] = encoder.fit_transform(train['cat_col2'], train['target'])
train['cat_col3'] = encoder.fit_transform(train['cat_col3'], train['target'])
train['cat_col4'] = encoder.fit_transform(train['cat_col4'], train['target'])
In addition, I would like to take the target encoded values of the train dataframe and replace all the categories in the test dataframe with the train target encoded values.
Answers:
Assuming you’re using the category_encoders
implementation, it should accept several columns just fine, at least for the recent versions:
cat_cols = ['cat_col1', 'cat_col2', 'cat_col3', 'cat_col4']
train[cat_cols] = encoder.fit_transform(train[cat_cols], train['target'])
test[cat_cols] = encoder.transform(test[cat_cols])
Alternatively, you could use a loop:
for column in cat_cols:
encoder = TargetEncoder()
train[column] = encoder.fit_transform(train[column], train['target'])
test[column] = encoder.transform(test[column])
I am trying to prepare a training dataset which contains many categorical columns with high cardinality to train a machine learning model. Therefore, I want to target encoding them so that I convert the categorical columns into numerical columns. Label encoding is not suitable because the categorical features are not ordinal
My train dataset looks like this where I have only taken 4 columns out of 20 columns
target | cat_col1 | cat_col2 | cat_col3 | cat_col4 |
---|---|---|---|---|
10 | city1 | james | 25-55 | abc |
20 | city2 | adam | 30-40 | bcc |
15 | city1 | charles | 30-40 | bcc |
I want to write an efficient code to target encode all the categorical columns without individually having to do each column.
The resulting training dataframe should look like
target | cat_col1 | cat_col2 | cat_col3 | cat_col4 |
---|---|---|---|---|
10 | 15 | 10 | 10 | 10 |
20 | 20 | 20 | 17 | 17 |
15 | 15 | 15 | 17 | 17 |
I can get the above output by writing code for each column but since I have 20 categorical, this does not seem efficient.
encoder = TargetEncoder()
train['cat_col1'] = encoder.fit_transform(train['cat_col1'], train['target'])
train['cat_col2'] = encoder.fit_transform(train['cat_col2'], train['target'])
train['cat_col3'] = encoder.fit_transform(train['cat_col3'], train['target'])
train['cat_col4'] = encoder.fit_transform(train['cat_col4'], train['target'])
In addition, I would like to take the target encoded values of the train dataframe and replace all the categories in the test dataframe with the train target encoded values.
Assuming you’re using the category_encoders
implementation, it should accept several columns just fine, at least for the recent versions:
cat_cols = ['cat_col1', 'cat_col2', 'cat_col3', 'cat_col4']
train[cat_cols] = encoder.fit_transform(train[cat_cols], train['target'])
test[cat_cols] = encoder.transform(test[cat_cols])
Alternatively, you could use a loop:
for column in cat_cols:
encoder = TargetEncoder()
train[column] = encoder.fit_transform(train[column], train['target'])
test[column] = encoder.transform(test[column])