After dropping columns with missing values, sklearn still throwing ValueError

Question:

I am currently taking the intermediate machine learning course on kaggle, and am quite new to machine learning.

I’m currently trying to create a Random Forest model and implementing OH Encoding on my data, but as it is my first time have been struggling a bit.

To keep things simple, I dropped all data with missing values:

import pandas as pd
from sklearn.model_selection import train_test_split

X = pd.read_csv('/kaggle/input/home-data-for-ml-course/train.csv', index_col='Id') 
X_test = pd.read_csv('/kaggle/input/home-data-for-ml-course/test.csv', index_col='Id')

X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

I then encoded my data with OH Encoding:

from sklearn.preprocessing import OneHotEncoder

object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
num_X_test = X_test.drop(object_cols, axis=1)

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test[low_cardinality_cols]))

OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
OH_cols_test.index = X_test.index

OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)

I’m not worried about using my validation data for now. When I then go on to create the model and predict values:

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(OH_X_train, y_train)
preds = model.predict(OH_X_test)

I get the following error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I’m a little confused as to why I’m getting this error, considering that I’ve followed the same method for both the training data and the test data. I’m able to create the model fine, but when I attempt to predict is when I get the error. Any help would be greatly appreciated!

Asked By: tyl3366

||

Answers:

I think I found your issue.

See the code below:

cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

What you did here is that you identified columns with missing values in dataframe X and then dropped these columns from both X and X_test. That is where your problem begins. You have also missing data in several columns of X_test.

I checked with;

[col for col in X_test.columns if X_test[col].isna().any()

Which resulted:

[‘MSZoning’,
‘Utilities’,
‘Exterior1st’,
‘Exterior2nd’,
‘BsmtFinSF1’,
‘BsmtFinSF2’,
‘BsmtUnfSF’,
‘TotalBsmtSF’,
‘BsmtFullBath’,
‘BsmtHalfBath’,
‘KitchenQual’,
‘Functional’,
‘GarageCars’,
‘GarageArea’,
‘SaleType’]

So you might consider filling these with mean of each column from X, or delete these rows as well.

Answered By: Orkun Aran

cols_with_missing = [col for col in X.columns if X[col].isnull().any()] you are checking columns with missing values only in training data. And there are some columns that did not have any null values in the training set but do have null values in the test set. That’s how you carried those nulls in the entire code.

Answered By: RAY