Is it possible to learn from and predict NaN-values with machine learning?

Question:

I’m trying to solve a regression problem with two output values. The output values act as two different thresholds for incoming booking values, to accept or reject the bookings.

The two output values are manually set in the business case but this shall be done automatically with the help of machine learning. One of the output values can be Nan in the business case, then all bookings will be accepted for this criteria. So if one output value isn’t filled, it’s valid for the business case.

example:

X_train = np.array([(1,1),(2,2),(3,3),(4,4)])
Y_train =np.array([(1,1),(2,2),(3,3),(4,np.nan)])
X_test = np.array([(5,5),(6,6),(7,7)])
Y_test = np.array([(5,5),(6,np.nan),(7,7)])

reg  = MLPRegressor()
reg = reg.fit(X_train,Y_train)

My problem is that scikit-learn for example throws an error, when I set NaN-values for the output Y_train/Y_test.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I don’t want to impute these values with the mean or 0 because, as mentioned above, missing values are a valid setting of the business case.

Is it possible to solve such a problem with scikit-learn or with machine learning in general?

EDIT: The output values that are not set by the business are not stored directly as NaN but as -9999999999 for infinity. I replaced these values by NaN because I thought such high values would distort the results. So the variables would actually result in the following if I don’t replace anything:

X_train = np.array([(1,1),(2,2),(3,3),(4,4)])
Y_train =np.array([(1,1),(2,2),(3,3),(4,-9999999999)])
X_test = np.array([(5,5),(6,6),(7,7)])
Y_test = np.array([(5,5),(6,-9999999999),(7,7)])

Is it better to keep those values than NaN or do they distory the results and have to be omitted?

Asked By: Taskmanager

||

Answers:

The whole point of training data is to supervise the model, teaching it to predict an output with a set of features. Keeping nan values as part of the training X, y, therefore, doesn’t make sense. A model is not going to ‘fill in the gaps’ and still learn. T

he standard way is to use missing value techniques such as – impute by mean/0, use KNN for replacing value by detecting nearest neighbors of that sample containing missing data, imputation techniques for sequential data (slinear, akima, quadratic, spline etc) or encoding methods that can handle missing data.

If you don’t want to use a missing value handling strategy, then you should NOT keep the row as part of the training dataset.

Is it possible to solve such a problem with scikit-learn or with machine learning in general?

Yes as I mentioned, there is a whole domain of research use to solve this problem (KNN is the most popular and accessible machine learning approach of handling this). This article may help guide you more.


EDIT (based on OPs edits)

Replacing the 99999999 values with Nan was the right approach since we don’t know why they have been set to that by the business. It’s most likely missing data that they have imputed with a garbage value for being able to store data in a database without too many issues. Secondly, it would be wiser to treat them as Nan values rather than outliers. Therefore, I would recommend removing the rows that have those values for purposes of supervised training.

Another thing is that I noticed those values are part of Y train and Y test. This makes things easier if the Nans are only in the Y data since then you can simply keep those rows as part of data for prediction. Train the model on non-nan data, and use that model to predict the Y value of the rows to replace Nan values.

If however, you think these are extreme values and should be considered as outliers, you will still have to remove them from model training since they would bias model results like crazy.

Lastly, if this were a classification exercise (not regression) then you could actually consider 999999 as a separate class and predict it like any of the other classes. This would not work with regression since in regression 999999 is part of a continuous scale over which the predictions are going to be made.

Answered By: Akshay Sehgal

Even if your model could generate NaNs as it’s output, there would be no way to tell if it’s an error or an actual estimation. I wouldn’t use NaNs in a training set.

Not only because NaN can’t be represented with any numerical data type, but also because it is not possible to perform arithmetic on NaNs: which means you can’t computes it’s gradients, you can’t compute a line or slope that intersects it. Simply put, your model wouldn’t be able to learn it as a numerical value, since it’s not a number.

Answered By: Bedir Yilmaz

Maybe breaking your problem into two tasks would be an acceptable solution. one for regression and for classification whether a data is provided or not.

X_train = np.array([(1,1),(2,2),(3,3),(4,4)])
Y1_train = np.array([(1,1),(2,2),(3,3),(4,4)])
Y2_train = np.array([(1,1),(1,1),(1,1),(1,0)])

X_test = np.array([(5,5),(6,6),(7,7)])
Y1_test = np.array([(5,5),(6,6),(7,7)])
Y2_test = np.array([(1,1),(1,0),(1,1)])

For regression do the same as you did with pair of X_train and Y1_train.

A sample code for classification part

from sklearn.neural_network import MLPClassifier
clf = MLPClassifier()
clf.fit(X_train, Y2_train)
Answered By: Sajad.sni