Modifying a subset of rows in a pandas dataframe
Question:
Assume I have a pandas DataFrame with two columns, A and B. I’d like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?
I tried the following
df['A'==0]['B'] = np.nan
and
df['A'==0]['B'].values.fill(np.nan)
without success.
Answers:
Use .loc
for label based indexing:
df.loc[df.A==0, 'B'] = np.nan
The df.A==0
expression creates a boolean series that indexes the rows, 'B'
selects the column. You can also use this to transform a subset of a column, e.g.:
df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2
I don’t know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I’ve found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.
Here is from pandas docs on advanced indexing:
The section will explain exactly what you need! Turns out df.loc
(as .ix has been deprecated — as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.
df.loc[selection criteria, columns I want] = value
So Bren’s answer is saying ‘find me all the places where df.A == 0
, select column B
and set it to np.nan
‘
Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc
here is a working example
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
A B
0 0 NaN
1 1 0
2 0 NaN
>>>
Explanation:
As explained in the doc here, .loc
is primarily label based, but may also be used with a boolean array.
So, what we are doing above is applying df.loc[row_index, column_index]
by:
- Exploiting the fact that
loc
can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
- Exploiting the fact
loc
is also label based to select the column using the label 'B'
in the column_index
We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows
that contain a 0
, for that we can use df.A == 0
, as you can see in the example below, this returns a series of booleans.
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df
A B
0 0 2
1 1 0
2 0 5
>>> df.A == 0
0 True
1 False
2 True
Name: A, dtype: bool
>>>
Then, we use the above array of booleans to select and modify the necessary rows:
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
A B
0 0 NaN
1 1 0
2 0 NaN
For more information check the advanced indexing documentation here.
To replace multiples columns convert to numpy array using .values
:
df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2
For a massive speed increase, use NumPy’s where function.
Setup
Create a two-column DataFrame with 100,000 rows with some zeros.
df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))
Fast solution with numpy.where
df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
Timings
%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy’s where
is about 4x faster
Alternatives:
no 1 looks best to me, but oddly I can’t find the supporting documentation for it
- Filter column as series (note: filter comes after column being written to, not before)
dataframe.column[filter condition]=values to change to
df.B[df.A==0] = np.nan
dataframe.loc[filter condition, column to change]=values to change to
df.loc[df.A == 0, 'B'] = np.nan
dataframe.column=np.where(filter condition, values if true, values if false)
import numpy as np
df.B = np.where(df.A== 0, np.nan, df.B)
dataframe.column=df.apply(lambda row: value if condition true else value if false, use rows not columns)
df.B = df.apply(lambda x: np.nan if x['A']==0 else x['B'],axis=1)
- zip and list syntax
dataframe.column=[valuse if condition is true else value if false for elements a,b in list from zip function of columns a and b]
df.B = [np.nan if a==0 else b for a,b in zip(df.A,df.B)]
To modify a DataFrame in Pandas you can use "syntactic sugar" operators like +=
, *=
, /=
etc. So instead of:
df.loc[df.A == 0, 'B'] = df.loc[df.A == 0, 'B'] / 2
You can write:
df.loc[df.A == 0, 'B'] /= 2
To replace values with NaN
you can use Pandas methods mask
or where
. For example:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [0, 0, 4]})
A B
0 1 0
1 2 0
2 3 4
df['A'].mask(df['B'] == 0, inplace=True) # other=np.nan by default
# df['A'].where(df['B'] != 0, inplace=True)
Result:
A B
0 NaN 0
1 NaN 0
2 3.0 4
Assume I have a pandas DataFrame with two columns, A and B. I’d like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?
I tried the following
df['A'==0]['B'] = np.nan
and
df['A'==0]['B'].values.fill(np.nan)
without success.
Use .loc
for label based indexing:
df.loc[df.A==0, 'B'] = np.nan
The df.A==0
expression creates a boolean series that indexes the rows, 'B'
selects the column. You can also use this to transform a subset of a column, e.g.:
df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2
I don’t know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I’ve found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.
Here is from pandas docs on advanced indexing:
The section will explain exactly what you need! Turns out df.loc
(as .ix has been deprecated — as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.
df.loc[selection criteria, columns I want] = value
So Bren’s answer is saying ‘find me all the places where df.A == 0
, select column B
and set it to np.nan
‘
Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc
here is a working example
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
A B
0 0 NaN
1 1 0
2 0 NaN
>>>
Explanation:
As explained in the doc here, .loc
is primarily label based, but may also be used with a boolean array.
So, what we are doing above is applying df.loc[row_index, column_index]
by:
- Exploiting the fact that
loc
can take a boolean array as a mask that tells pandas which subset of rows we want to change inrow_index
- Exploiting the fact
loc
is also label based to select the column using the label'B'
in thecolumn_index
We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows
that contain a 0
, for that we can use df.A == 0
, as you can see in the example below, this returns a series of booleans.
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df
A B
0 0 2
1 1 0
2 0 5
>>> df.A == 0
0 True
1 False
2 True
Name: A, dtype: bool
>>>
Then, we use the above array of booleans to select and modify the necessary rows:
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
A B
0 0 NaN
1 1 0
2 0 NaN
For more information check the advanced indexing documentation here.
To replace multiples columns convert to numpy array using .values
:
df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2
For a massive speed increase, use NumPy’s where function.
Setup
Create a two-column DataFrame with 100,000 rows with some zeros.
df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))
Fast solution with numpy.where
df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
Timings
%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy’s where
is about 4x faster
Alternatives:
no 1 looks best to me, but oddly I can’t find the supporting documentation for it
- Filter column as series (note: filter comes after column being written to, not before)
dataframe.column[filter condition]=values to change to
df.B[df.A==0] = np.nan
dataframe.loc[filter condition, column to change]=values to change to
df.loc[df.A == 0, 'B'] = np.nan
dataframe.column=np.where(filter condition, values if true, values if false)
import numpy as np
df.B = np.where(df.A== 0, np.nan, df.B)
dataframe.column=df.apply(lambda row: value if condition true else value if false, use rows not columns)
df.B = df.apply(lambda x: np.nan if x['A']==0 else x['B'],axis=1)
- zip and list syntax
dataframe.column=[valuse if condition is true else value if false for elements a,b in list from zip function of columns a and b]
df.B = [np.nan if a==0 else b for a,b in zip(df.A,df.B)]
To modify a DataFrame in Pandas you can use "syntactic sugar" operators like +=
, *=
, /=
etc. So instead of:
df.loc[df.A == 0, 'B'] = df.loc[df.A == 0, 'B'] / 2
You can write:
df.loc[df.A == 0, 'B'] /= 2
To replace values with NaN
you can use Pandas methods mask
or where
. For example:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [0, 0, 4]})
A B
0 1 0
1 2 0
2 3 4
df['A'].mask(df['B'] == 0, inplace=True) # other=np.nan by default
# df['A'].where(df['B'] != 0, inplace=True)
Result:
A B
0 NaN 0
1 NaN 0
2 3.0 4