StandardScaler -ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Question:

I have the following code

X = df_X.as_matrix(header[1:col_num])
scaler = preprocessing.StandardScaler().fit(X)
X_nor = scaler.transform(X) 

And got the following errors:

  File "/Users/edamame/Library/python_virenv/lib/python2.7/site-packages/sklearn/utils/validation.py", line 54, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I used:

print(np.isinf(X))
print(np.isnan(X))

which gives me the output below. This couldn’t really tell me which element has issue as I have millions of rows.

[[False False False ..., False False False]
 [False False False ..., False False False]
 [False False False ..., False False False]
 ..., 
 [False False False ..., False False False]
 [False False False ..., False False False]
 [False False False ..., False False False]]

Is there a way to identify which value in the matrix X actually cause the problem? How do people avoid it in general?

Asked By: Edamame

||

Answers:

numpy contains various logical element-wise tests for this sort of thing.

In your particular case, you will want to use isinf and isnan.

In response to your edit:

You can pass the result of np.isinf() or np.isnan() to np.where(), which will return the indices where a condition is true. Here’s a quick example:

import numpy as np

test = np.array([0.1, 0.3, float("Inf"), 0.2])

bad_indices = np.where(np.isinf(test))

print(bad_indices)

You can then use those indices to replace the content of the array:

test[bad_indices] = -1

Answered By: Thomite

"How do people avoid it in general?"

Real example:

data360 = pd.read_csv(r'C:...')

s = StandardScaler()
data360 = s.fit_transform(data360)

print(np.where(np.isnan(data360)))

Output:

(array([ 130, 161, 889, …, 1884216, 1884276, 1884550],
dtype=int64), array([1, 1, 1, …, 1, 1, 1], dtype=int64))

You may, out of curiosity, check whether this is true or not by finding one of the rows in question (I checked row 132 in my csv file, which corresponds to 130 in my array):
1010, 131, 0.115462015, nan, 0.291065837, 0.083311105, 8, 2, 2

One way to "fix" the issue:
df_new = data360[np.isfinite(data360).all(1)]

This returns the same data frame, without the rows containing NaN.

Checking the len() prior and after processing reveals that the data set has now been reduced (in my case) from 1884600 to 1870298.

Edit: you need to evaluate the data that you are in possession of and what you will use it for, before you just remove all rows containing NaN.

Answered By: 零審議
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.