vectorize conditional assignment in pandas dataframe
Question:
If I have a dataframe df
with column x
and want to create column y
based on values of x
using this in pseudo code:
if df['x'] < -2 then df['y'] = 1
else if df['x'] > 2 then df['y'] = -1
else df['y'] = 0
How would I achieve this? I assume np.where
is the best way to do this but not sure how to code it correctly.
Answers:
One simple method would be to assign the default value first and then perform 2 loc
calls:
In [66]:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
x
0 0
1 -3
2 5
3 -1
4 1
In [69]:
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
If you wanted to use np.where
then you could do it with a nested np.where
:
In [77]:
df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
So here we define the first condition as where x is less than -2, return 1, then we have another np.where
which tests the other condition where x is greater than 2 and returns -1, otherwise return 0
timings
In [79]:
%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
1000 loops, best of 3: 1.79 ms per loop
In [81]:
%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
100 loops, best of 3: 3.27 ms per loop
So for this sample dataset the np.where
method is twice as fast
This is a good use case for pd.cut
where you define ranges and based on those ranges
you can assign labels
:
df['y'] = pd.cut(df['x'], [-np.inf, -2, 2, np.inf], labels=[1, 0, -1], right=False)
Output
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
Use np.select
for multiple conditions
np.select(condlist, choicelist, default=0)
- Return elements in
choicelist
depending on the corresponding condition in condlist
.
- The
default
element is used when all conditions evaluate to False
.
condlist = [
df['x'] < -2,
df['x'] > 2,
]
choicelist = [
1,
-1,
]
df['y'] = np.select(condlist, choicelist, default=0)
np.select
is much more readable than a nested np.where
but just as fast:
df = pd.DataFrame({'x': np.random.randint(-5, 5, size=n)})
You can do it easily using the index and 2 loc
calls:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
x
0 0
1 -3
2 5
3 -1
4 1
df['y'] = 0
idx_1 = df.loc[df['x'] < -2, 'y'].index
idx_2 = df.loc[df['x'] > 2, 'y'].index
df.loc[idx_1, 'y'] = 1
df.loc[idx_2, 'y'] = -1
df
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
If I have a dataframe df
with column x
and want to create column y
based on values of x
using this in pseudo code:
if df['x'] < -2 then df['y'] = 1
else if df['x'] > 2 then df['y'] = -1
else df['y'] = 0
How would I achieve this? I assume np.where
is the best way to do this but not sure how to code it correctly.
One simple method would be to assign the default value first and then perform 2 loc
calls:
In [66]:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
x
0 0
1 -3
2 5
3 -1
4 1
In [69]:
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
If you wanted to use np.where
then you could do it with a nested np.where
:
In [77]:
df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
So here we define the first condition as where x is less than -2, return 1, then we have another np.where
which tests the other condition where x is greater than 2 and returns -1, otherwise return 0
timings
In [79]:
%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
1000 loops, best of 3: 1.79 ms per loop
In [81]:
%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
100 loops, best of 3: 3.27 ms per loop
So for this sample dataset the np.where
method is twice as fast
This is a good use case for pd.cut
where you define ranges and based on those ranges
you can assign labels
:
df['y'] = pd.cut(df['x'], [-np.inf, -2, 2, np.inf], labels=[1, 0, -1], right=False)
Output
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
Use np.select
for multiple conditions
np.select(condlist, choicelist, default=0)
- Return elements in
choicelist
depending on the corresponding condition incondlist
.- The
default
element is used when all conditions evaluate toFalse
.
condlist = [
df['x'] < -2,
df['x'] > 2,
]
choicelist = [
1,
-1,
]
df['y'] = np.select(condlist, choicelist, default=0)
np.select
is much more readable than a nested np.where
but just as fast:
df = pd.DataFrame({'x': np.random.randint(-5, 5, size=n)})
You can do it easily using the index and 2 loc
calls:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
x
0 0
1 -3
2 5
3 -1
4 1
df['y'] = 0
idx_1 = df.loc[df['x'] < -2, 'y'].index
idx_2 = df.loc[df['x'] > 2, 'y'].index
df.loc[idx_1, 'y'] = 1
df.loc[idx_2, 'y'] = -1
df
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0