Normalise between 0 and 1 ignoring NaN

Question:

For a list of numbers ranging from x to y that may contain NaN, how can I normalise between 0 and 1, ignoring the NaN values (they stay as NaN).

Typically I would use MinMaxScaler (ref page) from sklearn.preprocessing, but this cannot handle NaN and recommends imputing the values based on mean or median etc. it doesn’t offer the option to ignore all the NaN values.

Asked By: JakeCowton

||

Answers:

consider pd.Series s

s = pd.Series(np.random.choice([3, 4, 5, 6, np.nan], 100))
s.hist()

enter image description here


Option 1
Min Max Scaling

new = s.sub(s.min()).div((s.max() - s.min()))
new.hist()

enter image description here


NOT WHAT OP ASKED FOR
I put these in because I wanted to

Option 2
sigmoid

sigmoid = lambda x: 1 / (1 + np.exp(-x))

new = sigmoid(s.sub(s.mean()))
new.hist()

enter image description here


Option 3
tanh (hyperbolic tangent)

new = np.tanh(s.sub(s.mean())).add(1).div(2)
new.hist()

enter image description here

Answered By: piRSquared

Here’s a different approach and one that I believe answers the OP correctly, the only difference is this works for a dataframe instead of a list, you can easily put your list in a dataframe as done below. The other options didn’t work for me because I needed to store the MinMaxScaler in order to reverse transform after a prediction was made. So instead of passing the entire column to the MinMaxScaler you can filter out NaNs for both the target and the input.

Solution Example

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

d = pd.DataFrame({'A': [0, 1, 2, 3, np.nan, 3, 2]})

null_index = d['A'].isnull()

d.loc[~null_index, ['A']] = scaler.fit_transform(d.loc[~null_index, ['A']])
Answered By: Chris Farr

It seems that sklearn now (June 2020) behaves as you (and me) desire:
np.nan is left untouched.
(mainly copy pasted from sklearn docs)

import sklearn
import numpy as np
from sklearn.preprocessing import MinMaxScaler
sklearn.__version__
# '0.23.1'
data = np.array([[-1, 2, 3], [-0.5, 6,3 ], [np.nan, 18, 3 ]])
print(data)
#[[-1.   2.   3. ]
# [-0.5  6.   3. ]
# [ nan 18.   3. ]]
scaler = MinMaxScaler()
data = scaler.fit_transform(data)
print(data)
#[[0.   0.   0.  ]
# [1.   0.25 0.  ]
# [ nan 1.   0.  ]]
Answered By: Markus Kaukonen
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.