NaN values created when joining two dataframes

Question:

I am trying to one hot encode data using the sci-kit learn library from, kaggle https://www.kaggle.com/datasets/rkiattisak/salaly-prediction-for-beginer

kaggle data

X is a two column dataframe of the age and years of experience columns with the rows containing null values cleaned out with dropna(). My goal is to one hot encode the Gender and Education columns and merge the one hot encoded values with the values in X. Here are X and df_one pre-join:

X and df_one pre-Join

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
myTransformer = ColumnTransformer(
    transformers=[('one_hot_encoder', OneHotEncoder(), ['Gender', 'Education Level'])],
)
transformed = myTransformer.fit_transform(X)
columns = myTransformer.named_transformers_['one_hot_encoder'].get_feature_names_out(['Gender', 'Education Level']).tolist()
df_ohe = pd.DataFrame(transformed, columns=columns)
display(df_ohe)
print(df_ohe.isnull().sum())
X = X.drop(columns=['Gender', 'Education Level'])
display(X)
print(X.isnull().sum())
X = X.join(df_ohe)
print(X.isnull().sum())
X

Before I ran the transformer, I cleaned my data so that all rows with NaN values were removed using the dropna() function, and using the .isnull().sum() method I was able to confirm that there were no null values left. Debugging also confirmed that there are no null values in df_ohe, the one hot encoded dataframe. Printing the isnull sum confirms that two rows contain NaN values for the one hot encoded columns once X has been joined with df_ohe. After the join the data is as below:
post-join

Does anyone know why this might be happening or if there’s a better/safer way to join these dataframes?

Asked By: camdenmcgath

||

Answers:

I think this is what you were trying to do. I think the problem was the indexes were different between the dataframes you were merging. You can reset the index and then merge them together like I do below.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd

X = pd.read_csv('Salary Data.csv')
X = X.dropna()

ohe_vars = ['Gender', 'Education Level']
OHE_X = X[ohe_vars]
myTransformer = ColumnTransformer(
    transformers=[('one_hot_encoder',
                   OneHotEncoder(),
                   ohe_vars)],
)
transformed = myTransformer.fit_transform(OHE_X)
columns = myTransformer 
    .named_transformers_['one_hot_encoder'] 
    .get_feature_names_out(ohe_vars) 
    .tolist()
df_ohe = pd.DataFrame(transformed, columns=columns)
df_ohe.reset_index(drop=True, inplace=True)
X.reset_index(drop=True, inplace=True)
print(df_ohe.shape)
print(X.shape)
new_df = X.join(df_ohe)
print(new_df.shape)


Answered By: A Simple Programmer
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.