NaN values created when joining two dataframes
Question:
I am trying to one hot encode data using the sci-kit learn library from, kaggle https://www.kaggle.com/datasets/rkiattisak/salaly-prediction-for-beginer
X is a two column dataframe of the age and years of experience columns with the rows containing null values cleaned out with dropna()
. My goal is to one hot encode the Gender and Education columns and merge the one hot encoded values with the values in X. Here are X and df_one pre-join:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
myTransformer = ColumnTransformer(
transformers=[('one_hot_encoder', OneHotEncoder(), ['Gender', 'Education Level'])],
)
transformed = myTransformer.fit_transform(X)
columns = myTransformer.named_transformers_['one_hot_encoder'].get_feature_names_out(['Gender', 'Education Level']).tolist()
df_ohe = pd.DataFrame(transformed, columns=columns)
display(df_ohe)
print(df_ohe.isnull().sum())
X = X.drop(columns=['Gender', 'Education Level'])
display(X)
print(X.isnull().sum())
X = X.join(df_ohe)
print(X.isnull().sum())
X
Before I ran the transformer, I cleaned my data so that all rows with NaN values were removed using the dropna()
function, and using the .isnull().sum()
method I was able to confirm that there were no null values left. Debugging also confirmed that there are no null values in df_ohe, the one hot encoded dataframe. Printing the isnull sum confirms that two rows contain NaN values for the one hot encoded columns once X has been joined with df_ohe. After the join the data is as below:
Does anyone know why this might be happening or if there’s a better/safer way to join these dataframes?
Answers:
I think this is what you were trying to do. I think the problem was the indexes were different between the dataframes you were merging. You can reset the index and then merge them together like I do below.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
X = pd.read_csv('Salary Data.csv')
X = X.dropna()
ohe_vars = ['Gender', 'Education Level']
OHE_X = X[ohe_vars]
myTransformer = ColumnTransformer(
transformers=[('one_hot_encoder',
OneHotEncoder(),
ohe_vars)],
)
transformed = myTransformer.fit_transform(OHE_X)
columns = myTransformer
.named_transformers_['one_hot_encoder']
.get_feature_names_out(ohe_vars)
.tolist()
df_ohe = pd.DataFrame(transformed, columns=columns)
df_ohe.reset_index(drop=True, inplace=True)
X.reset_index(drop=True, inplace=True)
print(df_ohe.shape)
print(X.shape)
new_df = X.join(df_ohe)
print(new_df.shape)
I am trying to one hot encode data using the sci-kit learn library from, kaggle https://www.kaggle.com/datasets/rkiattisak/salaly-prediction-for-beginer
X is a two column dataframe of the age and years of experience columns with the rows containing null values cleaned out with dropna()
. My goal is to one hot encode the Gender and Education columns and merge the one hot encoded values with the values in X. Here are X and df_one pre-join:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
myTransformer = ColumnTransformer(
transformers=[('one_hot_encoder', OneHotEncoder(), ['Gender', 'Education Level'])],
)
transformed = myTransformer.fit_transform(X)
columns = myTransformer.named_transformers_['one_hot_encoder'].get_feature_names_out(['Gender', 'Education Level']).tolist()
df_ohe = pd.DataFrame(transformed, columns=columns)
display(df_ohe)
print(df_ohe.isnull().sum())
X = X.drop(columns=['Gender', 'Education Level'])
display(X)
print(X.isnull().sum())
X = X.join(df_ohe)
print(X.isnull().sum())
X
Before I ran the transformer, I cleaned my data so that all rows with NaN values were removed using the dropna()
function, and using the .isnull().sum()
method I was able to confirm that there were no null values left. Debugging also confirmed that there are no null values in df_ohe, the one hot encoded dataframe. Printing the isnull sum confirms that two rows contain NaN values for the one hot encoded columns once X has been joined with df_ohe. After the join the data is as below:
Does anyone know why this might be happening or if there’s a better/safer way to join these dataframes?
I think this is what you were trying to do. I think the problem was the indexes were different between the dataframes you were merging. You can reset the index and then merge them together like I do below.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
X = pd.read_csv('Salary Data.csv')
X = X.dropna()
ohe_vars = ['Gender', 'Education Level']
OHE_X = X[ohe_vars]
myTransformer = ColumnTransformer(
transformers=[('one_hot_encoder',
OneHotEncoder(),
ohe_vars)],
)
transformed = myTransformer.fit_transform(OHE_X)
columns = myTransformer
.named_transformers_['one_hot_encoder']
.get_feature_names_out(ohe_vars)
.tolist()
df_ohe = pd.DataFrame(transformed, columns=columns)
df_ohe.reset_index(drop=True, inplace=True)
X.reset_index(drop=True, inplace=True)
print(df_ohe.shape)
print(X.shape)
new_df = X.join(df_ohe)
print(new_df.shape)