Losing my target variable when encoding categorial variables

Question:

I am dealing with a little challenge. I am trying to create a logistic regression model (multicass). Some of my variables are categorical, therefore I’m trying to encode them.

My initial dataset looks like that:

enter image description here

The column I want to predict is action1_preflop, it contains 3 possibles classes: "r","c","f"

When encoding categorical features, I end up losing the variable I want to predict as it gets converted into 3 sub-variables:
action1_preflop_r
action1_preflop_f
action1_preflop_c

Below is the new dataframe after encoding

       tiers  tiers2_theory  ...  action1_preflop_f  action1_preflop_r
0          7             11  ...                  1                  0
1          1              7  ...                  0                  1
2          5             11  ...                  1                  0
3          1             11  ...                  0                  1
4          1              7  ...                  0                  1
     ...            ...  ...                ...                ...
31007      4             11  ...                  0                  1
31008      1             11  ...                  0                  1
31009      1             11  ...                  0                  1
31010      1             11  ...                  0                  1
31011      2              7  ...                  0                  1

[31012 rows x 11 columns]

Could you please let me know how I am supposed to deal with those new variables considering that the initial variable before being encoded was actually the variable I wanted to target from prediction?

Thanks for the help

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn import linear_model

df_raw = pd.read_csv('\Users\rapha\Desktop\Consulting\Poker\Tables test\SB_preflop_a1_prob V1.csv', sep=";")




#Select categorical features only & use binary encoding

feature_cols = ['tiers','tiers2_theory','tiers3_theory','assorties','score','proba_preflop','action1_preflop']

df_raw = df_raw[feature_cols]
cat_features = df_raw.select_dtypes(include=[object])
num_features = df_raw.select_dtypes(exclude=[object])
df = num_features.join(pd.get_dummies(cat_features))
df = df.select_dtypes(exclude = [object])

df_outcome = df.action1_preflop
df_variables = df.drop('action1_preflop',axis=1)




x = df_variables
y = df.action1_preflop

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)

lm = linear_model.LogisticRegression(multi_class='ovr', solver='liblinear')
lm.fit(x_train, y_train)

predict_test=lm.predict(x_test)
print(lm.score(x_test, y_test))
Asked By: Raphaël Ambit

||

Answers:

You should leave the ‘action1_preflop‘ out of the ‘cat_features‘ dataframe and include it in the ‘num_features‘ dataframe:

cat_features = df_raw.select_dtypes(include=[object])
cat_features = cat_features.drop(['action1_preflop'], axis=1)
num_features = df_raw.select_dtypes(exclude=[object])
num_features = pd.concat([num_features, df_raw['action1_preflop']
Answered By: gtomer

You can also save some typing, and joining too

cat_features = df_raw.select_dtypes(include=[object]).columns.to_list()
cat_features.remove("action1_preflop")

And then, you can just include this list of columns in the columns parameter

df = pd.get_dummies(df_raw, columns=cat_features)
Answered By: Nohman