NaN values created when joining two dataframes

Question

I am trying to one hot encode data using the sci-kit learn library from, kaggle https://www.kaggle.com/datasets/rkiattisak/salaly-prediction-for-beginer

X is a two column dataframe of the age and years of experience columns with the rows containing null values cleaned out with dropna(). My goal is to one hot encode the Gender and Education columns and merge the one hot encoded values with the values in X. Here are X and df_one pre-join:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
myTransformer = ColumnTransformer(
    transformers=[('one_hot_encoder', OneHotEncoder(), ['Gender', 'Education Level'])],
)
transformed = myTransformer.fit_transform(X)
columns = myTransformer.named_transformers_['one_hot_encoder'].get_feature_names_out(['Gender', 'Education Level']).tolist()
df_ohe = pd.DataFrame(transformed, columns=columns)
display(df_ohe)
print(df_ohe.isnull().sum())
X = X.drop(columns=['Gender', 'Education Level'])
display(X)
print(X.isnull().sum())
X = X.join(df_ohe)
print(X.isnull().sum())
X

Before I ran the transformer, I cleaned my data so that all rows with NaN values were removed using the dropna() function, and using the .isnull().sum() method I was able to confirm that there were no null values left. Debugging also confirmed that there are no null values in df_ohe, the one hot encoded dataframe. Printing the isnull sum confirms that two rows contain NaN values for the one hot encoded columns once X has been joined with df_ohe. After the join the data is as below:

Does anyone know why this might be happening or if there’s a better/safer way to join these dataframes?

Asked By: camdenmcgath

||

Source

Answer 1

I think this is what you were trying to do. I think the problem was the indexes were different between the dataframes you were merging. You can reset the index and then merge them together like I do below.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd

X = pd.read_csv('Salary Data.csv')
X = X.dropna()

ohe_vars = ['Gender', 'Education Level']
OHE_X = X[ohe_vars]
myTransformer = ColumnTransformer(
    transformers=[('one_hot_encoder',
                   OneHotEncoder(),
                   ohe_vars)],
)
transformed = myTransformer.fit_transform(OHE_X)
columns = myTransformer 
    .named_transformers_['one_hot_encoder'] 
    .get_feature_names_out(ohe_vars) 
    .tolist()
df_ohe = pd.DataFrame(transformed, columns=columns)
df_ohe.reset_index(drop=True, inplace=True)
X.reset_index(drop=True, inplace=True)
print(df_ohe.shape)
print(X.shape)
new_df = X.join(df_ohe)
print(new_df.shape)

Answered By: A Simple Programmer

NaN values created when joining two dataframes

Question:

Answers: