How to set a cell to NaN in a pandas dataframe
Question:
I’d like to replace bad values in a column of a dataframe by NaN’s.
mydata = {'x' : [10, 50, 18, 32, 47, 20], 'y' : ['12', '11', 'N/A', '13', '15', 'N/A']}
df = pd.DataFrame(mydata)
df[df.y == 'N/A']['y'] = np.nan
Though, the last line fails and throws a warning because it’s working on a copy of df
. So, what’s the correct way to handle this? I’ve seen many solutions with iloc
or ix
but here I need to use a boolean condition.
Answers:
just use replace
:
In [106]:
df.replace('N/A',np.NaN)
Out[106]:
x y
0 10 12
1 50 11
2 18 NaN
3 32 13
4 47 15
5 20 NaN
What you’re trying is called chain indexing: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
You can use loc
to ensure you operate on the original dF:
In [108]:
df.loc[df['y'] == 'N/A','y'] = np.nan
df
Out[108]:
x y
0 10 12
1 50 11
2 18 NaN
3 32 13
4 47 15
5 20 NaN
You can use replace:
df['y'] = df['y'].replace({'N/A': np.nan})
Also be aware of the inplace
parameter for replace
. You can do something like:
df.replace({'N/A': np.nan}, inplace=True)
This will replace all instances in the df without creating a copy.
Similarly, if you run into other types of unknown values such as empty string or None value:
df['y'] = df['y'].replace({'': np.nan})
df['y'] = df['y'].replace({None: np.nan})
Reference: Pandas Latest – Replace
While using replace
seems to solve the problem, I would like to propose an alternative. Problem with mix of numeric and some string values in the column not to have strings replaced with np.nan, but to make whole column proper. I would bet that original column most likely is of an object type
Name: y, dtype: object
What you really need is to make it a numeric column (it will have proper type and would be quite faster), with all non-numeric values replaced by NaN.
Thus, good conversion code would be
pd.to_numeric(df['y'], errors='coerce')
Specify errors='coerce'
to force strings that can’t be parsed to a numeric value to become NaN. Column type would be
Name: y, dtype: float64
You can try these snippets.
In [16]:mydata = {'x' : [10, 50, 18, 32, 47, 20], 'y' : ['12', '11', 'N/A', '13', '15', 'N/A']}
In [17]:df=pd.DataFrame(mydata)
In [18]:df.y[df.y=="N/A"]=np.nan
Out[19]:df
x y
0 10 12
1 50 11
2 18 NaN
3 32 13
4 47 15
5 20 NaN
df.loc[df.y == 'N/A',['y']] = np.nan
This solve your problem. With the double [], you are working on a copy of the DataFrame. You have to specify exact location in one call to be able to modify it.
As of pandas 1.0.0, you no longer need to use numpy to create null values in your dataframe. Instead you can just use pandas.NA (which is of type pandas._libs.missing.NAType), so it will be treated as null within the dataframe but will not be null outside dataframe context.
To replace value directly in the DataFrame
, use the inplace
argument.
df.replace('columnvalue', np.NaN, inplace=True)
Most replies here above need to import an external module:
import numpy as np
There is a built-in solution into pandas itself: pd.NA
, to use like this:
df.replace('N/A', pd.NA)
you can use this method fillna which pandas gives
df.fillna(0,inplace=True)
first parameter is whatever value you want to replace the NA with.
By default, the Pandas fillna method returns a new dataframe. (This is the default behavior because by default, the inplace parameter is set to inplace = False.)
If you set inplace = True, the method will return nothing, and will instead directly modify the dataframe that’s being operated on.
I’d like to replace bad values in a column of a dataframe by NaN’s.
mydata = {'x' : [10, 50, 18, 32, 47, 20], 'y' : ['12', '11', 'N/A', '13', '15', 'N/A']}
df = pd.DataFrame(mydata)
df[df.y == 'N/A']['y'] = np.nan
Though, the last line fails and throws a warning because it’s working on a copy of df
. So, what’s the correct way to handle this? I’ve seen many solutions with iloc
or ix
but here I need to use a boolean condition.
just use replace
:
In [106]:
df.replace('N/A',np.NaN)
Out[106]:
x y
0 10 12
1 50 11
2 18 NaN
3 32 13
4 47 15
5 20 NaN
What you’re trying is called chain indexing: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
You can use loc
to ensure you operate on the original dF:
In [108]:
df.loc[df['y'] == 'N/A','y'] = np.nan
df
Out[108]:
x y
0 10 12
1 50 11
2 18 NaN
3 32 13
4 47 15
5 20 NaN
You can use replace:
df['y'] = df['y'].replace({'N/A': np.nan})
Also be aware of the inplace
parameter for replace
. You can do something like:
df.replace({'N/A': np.nan}, inplace=True)
This will replace all instances in the df without creating a copy.
Similarly, if you run into other types of unknown values such as empty string or None value:
df['y'] = df['y'].replace({'': np.nan})
df['y'] = df['y'].replace({None: np.nan})
Reference: Pandas Latest – Replace
While using replace
seems to solve the problem, I would like to propose an alternative. Problem with mix of numeric and some string values in the column not to have strings replaced with np.nan, but to make whole column proper. I would bet that original column most likely is of an object type
Name: y, dtype: object
What you really need is to make it a numeric column (it will have proper type and would be quite faster), with all non-numeric values replaced by NaN.
Thus, good conversion code would be
pd.to_numeric(df['y'], errors='coerce')
Specify errors='coerce'
to force strings that can’t be parsed to a numeric value to become NaN. Column type would be
Name: y, dtype: float64
You can try these snippets.
In [16]:mydata = {'x' : [10, 50, 18, 32, 47, 20], 'y' : ['12', '11', 'N/A', '13', '15', 'N/A']} In [17]:df=pd.DataFrame(mydata) In [18]:df.y[df.y=="N/A"]=np.nan Out[19]:df x y 0 10 12 1 50 11 2 18 NaN 3 32 13 4 47 15 5 20 NaN
df.loc[df.y == 'N/A',['y']] = np.nan
This solve your problem. With the double [], you are working on a copy of the DataFrame. You have to specify exact location in one call to be able to modify it.
As of pandas 1.0.0, you no longer need to use numpy to create null values in your dataframe. Instead you can just use pandas.NA (which is of type pandas._libs.missing.NAType), so it will be treated as null within the dataframe but will not be null outside dataframe context.
To replace value directly in the DataFrame
, use the inplace
argument.
df.replace('columnvalue', np.NaN, inplace=True)
Most replies here above need to import an external module:
import numpy as np
There is a built-in solution into pandas itself: pd.NA
, to use like this:
df.replace('N/A', pd.NA)
you can use this method fillna which pandas gives
df.fillna(0,inplace=True)
first parameter is whatever value you want to replace the NA with.
By default, the Pandas fillna method returns a new dataframe. (This is the default behavior because by default, the inplace parameter is set to inplace = False.)
If you set inplace = True, the method will return nothing, and will instead directly modify the dataframe that’s being operated on.