Python pandas: how to remove nan and -inf values
Question:
I have the following dataframe
time X Y X_t0 X_tp0 X_t1 X_tp1 X_t2 X_tp2
0 0.002876 0 10 0 NaN NaN NaN NaN NaN
1 0.002986 0 10 0 NaN 0 NaN NaN NaN
2 0.037367 1 10 1 1.000000 0 NaN 0 NaN
3 0.037374 2 10 2 0.500000 1 1.000000 0 NaN
4 0.037389 3 10 3 0.333333 2 0.500000 1 1.000000
5 0.037393 4 10 4 0.250000 3 0.333333 2 0.500000
....
1030308 9.962213 256 268 256 0.000000 256 0.003906 255 0.003922
1030309 10.041799 0 268 0 -inf 256 0.000000 256 0.003906
1030310 10.118960 0 268 0 NaN 0 -inf 256 0.000000
I tried with the following
df.dropna(inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
X_train = X_train.drop('time', axis=1)
X_train = X_train.drop('X_t1', axis=1)
X_train = X_train.drop('X_t2', axis=1)
X_test = X_test.drop('time', axis=1)
X_test = X_test.drop('X_t1', axis=1)
X_test = X_test.drop('X_t2', axis=1)
X_test.fillna(X_test.mean(), inplace=True)
X_train.fillna(X_train.mean(), inplace=True)
y_train.fillna(y_train.mean(), inplace=True)
However, I am still getting this error ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
whenever i try to fit a regression model fit(X_train, y_train)
How can we remove both the NaN
and -inf
values at the same time?
Answers:
Use pd.DataFrame.isin
and check for rows that have any with pd.DataFrame.any
. Finally, use the boolean array to slice the dataframe.
df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
time X Y X_t0 X_tp0 X_t1 X_tp1 X_t2 X_tp2
4 0.037389 3 10 3 0.333333 2.0 0.500000 1.0 1.000000
5 0.037393 4 10 4 0.250000 3.0 0.333333 2.0 0.500000
1030308 9.962213 256 268 256 0.000000 256.0 0.003906 255.0 0.003922
You can replace inf
and -inf
with NaN
, and then select non-null rows.
df[df.replace([np.inf, -np.inf], np.nan).notnull().all(axis=1)] # .astype(np.float64) ?
or
df.replace([np.inf, -np.inf], np.nan).dropna(axis=1)
Check the type of your columns returns to make sure they are all as expected (e.g. np.float32/64) via df.info()
.
df.replace([np.inf, -np.inf], np.nan)
df.dropna(inplace=True)
Instead of dropping rows which contain any nulls and infinite numbers, it is more succinct to the reverse the logic of that and instead return the rows where all cells are finite numbers. The numpy isfinite function does this and the ‘.all(1)’ will only return a TRUE if all cells in row are finite.
df = df[np.isfinite(df).all(1)]
Edit: If you have some non-numerical dtypes in your dataframe, you might want to isolate the float dtype columns of interest. See example below.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(3,4), columns=list("ABCD"))
df['E'] = 'a_string'
df.at[2,'D'] = np.nan
df.at[1,'B'] = np.inf
df['A'] = df['A'].astype(np.float16)
df
A B C D E
0 0.325195 0.199801 0.175851 0.989883 a_string
1 0.040192 inf 0.296379 0.632059 a_string
2 0.348877 0.369374 0.976187 NaN a_string
floating_columns = df.select_dtypes(include=[np.floating]).columns
subset_df = df[floating_columns]
df = df[np.isfinite(subset_df).all(1)]
df
A B C D E
0 0.381104 0.119991 0.388697 0.235735 a_string
df.replace
only replaces the first occurrence on the value and thus the error
df = list(filter(lambda x: x!= inf, df))
would remove all occurrences of inf
and then the drop
function can be used
I prefer to set the options so that inf values are calculated to nan;
s1 = pd.Series([0, 1, 2])
s2 = pd.Series([2, 1, 0])
s1/s2
# Outputs:
# 0.0
# 1.0
# inf
# dtype: float64
pd.set_option('mode.use_inf_as_na', True)
s1/s2
# Outputs:
# 0.0
# 1.0
# NaN
# dtype: float64
Note you can also use context;
with pd.option_context('mode.use_inf_as_na', True):
print(s1/s2)
# Outputs:
# 0.0
# 1.0
# NaN
# dtype: float64
I have the following dataframe
time X Y X_t0 X_tp0 X_t1 X_tp1 X_t2 X_tp2
0 0.002876 0 10 0 NaN NaN NaN NaN NaN
1 0.002986 0 10 0 NaN 0 NaN NaN NaN
2 0.037367 1 10 1 1.000000 0 NaN 0 NaN
3 0.037374 2 10 2 0.500000 1 1.000000 0 NaN
4 0.037389 3 10 3 0.333333 2 0.500000 1 1.000000
5 0.037393 4 10 4 0.250000 3 0.333333 2 0.500000
....
1030308 9.962213 256 268 256 0.000000 256 0.003906 255 0.003922
1030309 10.041799 0 268 0 -inf 256 0.000000 256 0.003906
1030310 10.118960 0 268 0 NaN 0 -inf 256 0.000000
I tried with the following
df.dropna(inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
X_train = X_train.drop('time', axis=1)
X_train = X_train.drop('X_t1', axis=1)
X_train = X_train.drop('X_t2', axis=1)
X_test = X_test.drop('time', axis=1)
X_test = X_test.drop('X_t1', axis=1)
X_test = X_test.drop('X_t2', axis=1)
X_test.fillna(X_test.mean(), inplace=True)
X_train.fillna(X_train.mean(), inplace=True)
y_train.fillna(y_train.mean(), inplace=True)
However, I am still getting this error ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
whenever i try to fit a regression model fit(X_train, y_train)
How can we remove both the NaN
and -inf
values at the same time?
Use pd.DataFrame.isin
and check for rows that have any with pd.DataFrame.any
. Finally, use the boolean array to slice the dataframe.
df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
time X Y X_t0 X_tp0 X_t1 X_tp1 X_t2 X_tp2
4 0.037389 3 10 3 0.333333 2.0 0.500000 1.0 1.000000
5 0.037393 4 10 4 0.250000 3.0 0.333333 2.0 0.500000
1030308 9.962213 256 268 256 0.000000 256.0 0.003906 255.0 0.003922
You can replace inf
and -inf
with NaN
, and then select non-null rows.
df[df.replace([np.inf, -np.inf], np.nan).notnull().all(axis=1)] # .astype(np.float64) ?
or
df.replace([np.inf, -np.inf], np.nan).dropna(axis=1)
Check the type of your columns returns to make sure they are all as expected (e.g. np.float32/64) via df.info()
.
df.replace([np.inf, -np.inf], np.nan)
df.dropna(inplace=True)
Instead of dropping rows which contain any nulls and infinite numbers, it is more succinct to the reverse the logic of that and instead return the rows where all cells are finite numbers. The numpy isfinite function does this and the ‘.all(1)’ will only return a TRUE if all cells in row are finite.
df = df[np.isfinite(df).all(1)]
Edit: If you have some non-numerical dtypes in your dataframe, you might want to isolate the float dtype columns of interest. See example below.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(3,4), columns=list("ABCD"))
df['E'] = 'a_string'
df.at[2,'D'] = np.nan
df.at[1,'B'] = np.inf
df['A'] = df['A'].astype(np.float16)
df
A B C D E
0 0.325195 0.199801 0.175851 0.989883 a_string
1 0.040192 inf 0.296379 0.632059 a_string
2 0.348877 0.369374 0.976187 NaN a_string
floating_columns = df.select_dtypes(include=[np.floating]).columns
subset_df = df[floating_columns]
df = df[np.isfinite(subset_df).all(1)]
df
A B C D E
0 0.381104 0.119991 0.388697 0.235735 a_string
df.replace
only replaces the first occurrence on the value and thus the error
df = list(filter(lambda x: x!= inf, df))
would remove all occurrences of inf
and then the drop
function can be used
I prefer to set the options so that inf values are calculated to nan;
s1 = pd.Series([0, 1, 2])
s2 = pd.Series([2, 1, 0])
s1/s2
# Outputs:
# 0.0
# 1.0
# inf
# dtype: float64
pd.set_option('mode.use_inf_as_na', True)
s1/s2
# Outputs:
# 0.0
# 1.0
# NaN
# dtype: float64
Note you can also use context;
with pd.option_context('mode.use_inf_as_na', True):
print(s1/s2)
# Outputs:
# 0.0
# 1.0
# NaN
# dtype: float64