How to apply the same cat.codes to 2 different dataframes?

Question:

I have 2 dataframes X_train and X_test. These 2 dataframes have the same columns.

There is 1 column called levels that needs to be changed from str to int. However, each dataframe’s levels columns has different unique values:

X_train has: [‘Level 0’, ‘Level 10’, ‘Level 30’] as unique values.

X_test has: [‘Level 20’, ‘Level 40’] as unique values.

The goal is 1) Combine the unique values from both X_train and X_test, and then 2) apply the cat.codes to both dataframes so that they are consistent. How would I do that? Basically the cat.codes that are applied to both dataframes will be as follows, even though 1 dataframe may not have values the other dataframe has:

{0: 'Level 0', 1: 'Level 10', 2: 'Level 20', 3: 'Level 30', 4: 'Level 40'}

Right now I only have the below but I’m not sure how to get the unique values of both cat.codes.

X_train['levels'] = X_train['levels'].astype('category').cat.codes
X_test['levels'] = X_test['levels'].astype('category').cat.codes
Asked By: Katsu

||

Answers:

Use CategoricalDtype to control the codes:

lst = sorted(set(X_train['levels'].dropna().unique())
             | set(X_test['levels'].dropna().unique()))
lvl = pd.CategoricalDtype(lst, ordered=True)

X_train['codes'] = X_train['levels'].astype(lvl).cat.codes
X_test['codes'] = X_test['levels'].astype(lvl).cat.codes

Output:

>>> X_train
     levels  codes
0   Level 0      0
1  Level 10      1
2  Level 30      3

>>> X_test
     levels  codes
0  Level 20      2
1  Level 40      4
2       NaN     -1
Answered By: Corralien