One-Hot Encoding Question – Concept and Solution to My Problem (Kaggle Dataset)

Question:

I’m working on an exercise in Kaggle, it’s on their module for categorical variables, specifically the one – hot encoding section: https://www.kaggle.com/alexisbcook/categorical-variables
I’m through the entire workbook fine, and there’s one last piece I’m trying to work out, it’s the optional piece at the end to apply the one – hot encoder to predict the house sale values. I’ve worked out the following code`, but on the line in bold: OH_cols_test = pd.DatFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols])), i’m getting the error that the input contains NaN.

So my first question is, when it comes to one – hot encoding, shouldn’t NAs just be treated like any other category within a particular column? And second question is, if i want to remove these NAs, what’s the most efficient way? I tried imputation, but it looks like that only works for numbers? Can someone please let me know where I’m going wrong here? Thanks very much!:

from sklearn.preprocessing import OneHotEncoder

# Use as many lines of code as you need!
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
**OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols]))**

# One-hot encoding removed index; put it back
OH_cols_test.index = X_test.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_test = X_test.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)
Asked By: Peter

||

Answers:

So my first question is, when it comes to one – hot encoding, shouldn’t NAs just be treated like any other category within a particular column?

NA’s are just the absence of data, and so you can loosely think of rows with NA’s as being incomplete. You may find yourself dealing with a dataset where NAs occur in half of the rows, and will require some clever feature engineering to compensate for this. Think about it this way: if one hot encoding is a simple way to represent binary state (e.g. is_male, salary_is_less_than_100000, etc…), then what does NaN/null mean? You have a bit of a Schrodinger’s cat on your hands there. You’re generally safe to drop NA’s so long as it doesn’t mangle your dataset size. The amount of data loss you’re willing to handle is entirely situation-based (it’s probably fine for a practice exercise).

And second question is, if i want to remove these NAs, what’s the most efficient way? I tried imputation, but it looks like that only works for numbers?

May I suggest this.

Answered By: Nick Saccente

I deal with this topic on my blog. You can check the link at the bottom of this answer. All my code/logic appears directly below.

# There are various ways to deal with missing data points.
# You can simply drop records if they contain any nulls.
# data.dropna()
# You can fill nulls with zeros
# data.fillna(0)
# You can also fill with mean, median, or do a forward-fill or back-fill.
# The problems with all of these options, is that if you have a lot of missing values for one specific feature, 
# you won't be able to do very reliable predictive analytics.
# A viable alternative is to impute missing values using some machine learning techniques 
# (regression or classification).
import pandas as pd
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

# Load data
data = pd.read_csv('C:\Users\ryans\seaborn-data\titanic.csv')
print(data)
list(data)
data.dtypes

# Now, we will use a simple regression technique to predict the missing values
data_with_null = data[['survived','pclass','sibsp','parch','fare','age']]

data_without_null = data_with_null.dropna()
train_data_x = data_without_null.iloc[:,:5]
train_data_y = data_without_null.iloc[:,5]

linreg.fit(train_data_x,train_data_y)

test_data = data_with_null.iloc[:,:5]
age = pd.DataFrame(linreg.predict(test_data))

# check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)

# Find any/all missing data points in entire data set
print(data_with_null.isnull().sum().sum())
# WOW 177 NULLS!!

# LET'S IMPUTE MISSING VALUES...
# View age feature
age = list(linreg.predict(test_data))
print(age)

# Finally, we will join our predicted values back into the 'data_with_null' dataframe
data_with_null.age = age

# Check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)

https://github.com/ASH-WICUS/Notebooks/blob/master/Fillna%20with%20Predicted%20Values.ipynb

One final thought, just in case you don’t already know about this. There are two kinds of categorical data:

Labeled Data: The categories have an inherent order (small, medium, large)
When your data is labeled in some kind of order, USE LABEL ENCODING!
Nominal Data: The categories do not have an inherent order (states in the US)
When your data is nominal, and there is no specific order, USE ONE HOT ENCODING!

https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

Answered By: ASH