Pandas modify column values in place based on boolean array
Question:
I know how to create a new column with apply
or np.where
based on the values of another column, but a way of selectively changing the values of an existing column is escaping me; I suspect df.ix
is involved? Am I close?
For example, here’s a simple dataframe (mine has tens of thousands of rows). I would like to change the value in the ‘flag’ column (let’s say to ‘Blue’) if the name ends with the letter ‘e’:
>>> import pandas as pd
>>> df = pd.DataFrame({'name':['Mick', 'John', 'Christine', 'Stevie', 'Lindsey'],
'flag':['Purple', 'Red', nan, nan, nan]})[['name', 'flag']]
>>> print df
name flag
0 Mick Purple
1 John Red
2 Christine NaN
3 Stevie NaN
4 Lindsey NaN
[5 rows x 2 columns]
I can make a boolean series from my criteria:
>boolean_result = df.name.str.contains('e$')
>print boolean_result
0 False
1 False
2 True
3 True
4 False
Name: name, dtype: bool
I just need the crucial step to get the following result:
>>> print result_wanted
name flag
0 Mick Purple
1 John Red
2 Christine Blue
3 Stevie Blue
4 Lindsey NaN
Answers:
df['flag'][df.name.str.contains('e$')] = 'Blue'
DataFrame.mask(cond, other=nan)
does exactly things you want.
It replaces values with the value of other
where the condition is True.
df['flag'].mask(boolean_result, other='blue', inplace=True)
inplace=True
means to perform the operation in place on the data.
If you want to replace value on condition false, you could consider using DataFrame.where()
.
I know how to create a new column with apply
or np.where
based on the values of another column, but a way of selectively changing the values of an existing column is escaping me; I suspect df.ix
is involved? Am I close?
For example, here’s a simple dataframe (mine has tens of thousands of rows). I would like to change the value in the ‘flag’ column (let’s say to ‘Blue’) if the name ends with the letter ‘e’:
>>> import pandas as pd
>>> df = pd.DataFrame({'name':['Mick', 'John', 'Christine', 'Stevie', 'Lindsey'],
'flag':['Purple', 'Red', nan, nan, nan]})[['name', 'flag']]
>>> print df
name flag
0 Mick Purple
1 John Red
2 Christine NaN
3 Stevie NaN
4 Lindsey NaN
[5 rows x 2 columns]
I can make a boolean series from my criteria:
>boolean_result = df.name.str.contains('e$')
>print boolean_result
0 False
1 False
2 True
3 True
4 False
Name: name, dtype: bool
I just need the crucial step to get the following result:
>>> print result_wanted
name flag
0 Mick Purple
1 John Red
2 Christine Blue
3 Stevie Blue
4 Lindsey NaN
df['flag'][df.name.str.contains('e$')] = 'Blue'
DataFrame.mask(cond, other=nan)
does exactly things you want.
It replaces values with the value of other
where the condition is True.
df['flag'].mask(boolean_result, other='blue', inplace=True)
inplace=True
means to perform the operation in place on the data.
If you want to replace value on condition false, you could consider using DataFrame.where()
.