Replacing Pandas or Numpy Nan with a None to use with MysqlDB
Question:
I am trying to write a Pandas dataframe (or can use a numpy array) to a mysql database using MysqlDB . MysqlDB doesn’t seem understand ‘nan’ and my database throws out an error saying nan is not in the field list. I need to find a way to convert the ‘nan’ into a NoneType.
Any ideas?
Answers:
You can replace nan
with None
in your numpy array:
>>> x = np.array([1, np.nan, 3])
>>> y = np.where(np.isnan(x), None, x)
>>> print y
[1.0 None 3.0]
>>> print type(y[1])
<type 'NoneType'>
@bogatron has it right, you can use where
, it’s worth noting that you can do this natively in pandas:
df1 = df.where(pd.notnull(df), None)
Note: this changes the dtype of all columns to object
.
Example:
In [1]: df = pd.DataFrame([1, np.nan])
In [2]: df
Out[2]:
0
0 1
1 NaN
In [3]: df1 = df.where(pd.notnull(df), None)
In [4]: df1
Out[4]:
0
0 1
1 None
Note: what you cannot do recast the DataFrames dtype
to allow all datatypes types, using astype
, and then the DataFrame fillna
method:
df1 = df.astype(object).replace(np.nan, 'None')
Unfortunately neither this, nor using replace
, works with None
see this (closed) issue.
As an aside, it’s worth noting that for most use cases you don’t need to replace NaN with None, see this question about the difference between NaN and None in pandas.
However, in this specific case it seems you do (at least at the time of this answer).
Quite old, yet I stumbled upon the very same issue.
Try doing this:
df['col_replaced'] = df['col_with_npnans'].apply(lambda x: None if np.isnan(x) else x)
After stumbling around, this worked for me:
df = df.astype(object).where(pd.notnull(df),None)
df = df.replace({np.nan: None})
Note: For pandas versions <1.4, this changes the dtype of all affected columns to object
.
To avoid that, use this syntax instead:
df = df.replace(np.nan, None)
Credit goes to this guy here on this Github issue and Killian Huyghe‘s comment.
Just an addition to @Andy Hayden’s answer:
Since DataFrame.mask
is the opposite twin of DataFrame.where
, they have the exactly same signature but with opposite meaning:
DataFrame.where
is useful for Replacing values where the condition is False.
DataFrame.mask
is used for Replacing values where the condition is True.
So in this question, using df.mask(df.isna(), other=None, inplace=True)
might be more intuitive.
Another addition: be careful when replacing multiples and converting the type of the column back from object to float. If you want to be certain that your None
‘s won’t flip back to np.NaN
‘s apply @andy-hayden’s suggestion with using pd.where
.
Illustration of how replace can still go ‘wrong’:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({"a": [1, np.NAN, np.inf]})
In [4]: df
Out[4]:
a
0 1.0
1 NaN
2 inf
In [5]: df.replace({np.NAN: None})
Out[5]:
a
0 1
1 None
2 inf
In [6]: df.replace({np.NAN: None, np.inf: None})
Out[6]:
a
0 1.0
1 NaN
2 NaN
In [7]: df.where((pd.notnull(df)), None).replace({np.inf: None})
Out[7]:
a
0 1.0
1 NaN
2 NaN
I believe the cleanest way would be to make use of the na_value
argument in the pandas.DataFrame.to_numpy()
method (docs):
na_value : Any, optional
The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.
New in version 1.1.0.
You could e.g. convert to dictionaries with NaN’s replaced by None using
columns = df.columns.tolist()
dicts_with_nan_replaced = [
dict(zip(columns, x))
for x in df.to_numpy(na_value=None)
]
Do you have a code block to review by chance?
Using .loc, pandas can access records based on logic conditions (filtering) and do action with them (when using =). Setting a .loc mask equal to some value will change the return array inplace (so be a touch careful here; I suggest test on a df copy prior to using in code block).
df.loc[df['SomeColumn'].isna(), 'SomeColumn'] = None
The outer function is df.loc[row_label, column_label] = None. We’re going to use a boolean mask for row_label by using the .isna() method to find ‘NoneType’ values in our column SomeColumn.
We’ll use the .isna() method to return a boolean array of rows/records in column SomeColumn as our row_label: df[‘SomeColumn’].isna(). It will isolate all rows where SomeColumn has any of the ‘NoneType’ items pandas checks for with the .isna() method.
We’ll use the column_label both when masking the dataframe for the row_label, and to identify the column we want to act on for the .loc mask.
Finally, we set the .loc mask equal to None, so the rows/records returned are changed to None based on the masked index.
Below are links to pandas documentation regarding .loc & .isna().
References:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html
This worked for me:
df = df.fillna(0)
Convert numpy NaN to pandas NA before replacing with the where statement:
df = df.replace(np.NaN, pd.NA).where(df.notnull(), None)
After finding that neither the recommended answer, nor the alternate suggested worked for my application after a Pandas update to 1.3.2 I settled for safety with a brute force approach:
buf = df.to_json(orient='records')
recs = json.loads(buf)
replace np.nan
with None
is accomplished differently across different version of pandas:
if version.parse(pd.__version__) >= version.parse('1.3.0'):
df = df.replace({np.nan: None})
else:
df = df.where(pd.notnull(df), None)
this solves the issue that for pandas versions <1.3.0, if the values in df
are already None
then df.replace({np.nan: None})
will toggle them back to np.nan
(and vice versa).
Yet another option, that actually did the trick for me:
df = df.astype(object).replace(np.nan, None)
Astoundingly, None of the previous answers worked for me, so I had to do it for each column.
for column in df.columns:
df[column] = df[column].where(pd.notnull(df[column]), None)
Doing it by hand is the only way that is working for me right now.
This answare from @rodney cox worked for me in almost every case.
The following code set all columns to object
data type and then replace any null value to None. Setting the column data type to object is crucial because it prevents pandas to change the type further.
for col in df.columns:
df[col] = df[col].astype(object)
df.loc[df[col].isnull(), col] = None
Warning: This solution is not eficient, because it process columns that might not have np.nan values.
Sometimes it is better to use this code. Note that np refers to the numpy:
df = df.fillna(np.nan).replace([np.nan], [None])
This should work:
df["column"]=df["column"].apply(lambda x: None if pd.isnull(x) else x)
I am trying to write a Pandas dataframe (or can use a numpy array) to a mysql database using MysqlDB . MysqlDB doesn’t seem understand ‘nan’ and my database throws out an error saying nan is not in the field list. I need to find a way to convert the ‘nan’ into a NoneType.
Any ideas?
You can replace nan
with None
in your numpy array:
>>> x = np.array([1, np.nan, 3])
>>> y = np.where(np.isnan(x), None, x)
>>> print y
[1.0 None 3.0]
>>> print type(y[1])
<type 'NoneType'>
@bogatron has it right, you can use where
, it’s worth noting that you can do this natively in pandas:
df1 = df.where(pd.notnull(df), None)
Note: this changes the dtype of all columns to object
.
Example:
In [1]: df = pd.DataFrame([1, np.nan])
In [2]: df
Out[2]:
0
0 1
1 NaN
In [3]: df1 = df.where(pd.notnull(df), None)
In [4]: df1
Out[4]:
0
0 1
1 None
Note: what you cannot do recast the DataFrames dtype
to allow all datatypes types, using astype
, and then the DataFrame fillna
method:
df1 = df.astype(object).replace(np.nan, 'None')
Unfortunately neither this, nor using replace
, works with None
see this (closed) issue.
As an aside, it’s worth noting that for most use cases you don’t need to replace NaN with None, see this question about the difference between NaN and None in pandas.
However, in this specific case it seems you do (at least at the time of this answer).
Quite old, yet I stumbled upon the very same issue.
Try doing this:
df['col_replaced'] = df['col_with_npnans'].apply(lambda x: None if np.isnan(x) else x)
After stumbling around, this worked for me:
df = df.astype(object).where(pd.notnull(df),None)
df = df.replace({np.nan: None})
Note: For pandas versions <1.4, this changes the dtype of all affected columns to object
.
To avoid that, use this syntax instead:
df = df.replace(np.nan, None)
Credit goes to this guy here on this Github issue and Killian Huyghe‘s comment.
Just an addition to @Andy Hayden’s answer:
Since DataFrame.mask
is the opposite twin of DataFrame.where
, they have the exactly same signature but with opposite meaning:
DataFrame.where
is useful for Replacing values where the condition is False.DataFrame.mask
is used for Replacing values where the condition is True.
So in this question, using df.mask(df.isna(), other=None, inplace=True)
might be more intuitive.
Another addition: be careful when replacing multiples and converting the type of the column back from object to float. If you want to be certain that your None
‘s won’t flip back to np.NaN
‘s apply @andy-hayden’s suggestion with using pd.where
.
Illustration of how replace can still go ‘wrong’:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({"a": [1, np.NAN, np.inf]})
In [4]: df
Out[4]:
a
0 1.0
1 NaN
2 inf
In [5]: df.replace({np.NAN: None})
Out[5]:
a
0 1
1 None
2 inf
In [6]: df.replace({np.NAN: None, np.inf: None})
Out[6]:
a
0 1.0
1 NaN
2 NaN
In [7]: df.where((pd.notnull(df)), None).replace({np.inf: None})
Out[7]:
a
0 1.0
1 NaN
2 NaN
I believe the cleanest way would be to make use of the na_value
argument in the pandas.DataFrame.to_numpy()
method (docs):
na_value : Any, optional
The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.
New in version 1.1.0.
You could e.g. convert to dictionaries with NaN’s replaced by None using
columns = df.columns.tolist()
dicts_with_nan_replaced = [
dict(zip(columns, x))
for x in df.to_numpy(na_value=None)
]
Do you have a code block to review by chance?
Using .loc, pandas can access records based on logic conditions (filtering) and do action with them (when using =). Setting a .loc mask equal to some value will change the return array inplace (so be a touch careful here; I suggest test on a df copy prior to using in code block).
df.loc[df['SomeColumn'].isna(), 'SomeColumn'] = None
The outer function is df.loc[row_label, column_label] = None. We’re going to use a boolean mask for row_label by using the .isna() method to find ‘NoneType’ values in our column SomeColumn.
We’ll use the .isna() method to return a boolean array of rows/records in column SomeColumn as our row_label: df[‘SomeColumn’].isna(). It will isolate all rows where SomeColumn has any of the ‘NoneType’ items pandas checks for with the .isna() method.
We’ll use the column_label both when masking the dataframe for the row_label, and to identify the column we want to act on for the .loc mask.
Finally, we set the .loc mask equal to None, so the rows/records returned are changed to None based on the masked index.
Below are links to pandas documentation regarding .loc & .isna().
References:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html
This worked for me:
df = df.fillna(0)
Convert numpy NaN to pandas NA before replacing with the where statement:
df = df.replace(np.NaN, pd.NA).where(df.notnull(), None)
After finding that neither the recommended answer, nor the alternate suggested worked for my application after a Pandas update to 1.3.2 I settled for safety with a brute force approach:
buf = df.to_json(orient='records')
recs = json.loads(buf)
replace np.nan
with None
is accomplished differently across different version of pandas:
if version.parse(pd.__version__) >= version.parse('1.3.0'):
df = df.replace({np.nan: None})
else:
df = df.where(pd.notnull(df), None)
this solves the issue that for pandas versions <1.3.0, if the values in df
are already None
then df.replace({np.nan: None})
will toggle them back to np.nan
(and vice versa).
Yet another option, that actually did the trick for me:
df = df.astype(object).replace(np.nan, None)
Astoundingly, None of the previous answers worked for me, so I had to do it for each column.
for column in df.columns:
df[column] = df[column].where(pd.notnull(df[column]), None)
Doing it by hand is the only way that is working for me right now.
This answare from @rodney cox worked for me in almost every case.
The following code set all columns to object
data type and then replace any null value to None. Setting the column data type to object is crucial because it prevents pandas to change the type further.
for col in df.columns:
df[col] = df[col].astype(object)
df.loc[df[col].isnull(), col] = None
Warning: This solution is not eficient, because it process columns that might not have np.nan values.
Sometimes it is better to use this code. Note that np refers to the numpy:
df = df.fillna(np.nan).replace([np.nan], [None])
This should work:
df["column"]=df["column"].apply(lambda x: None if pd.isnull(x) else x)