How do I assign values based on multiple conditions for existing columns?

Question:

I would like to create a new column with a numerical value based on the following conditions:

a. if gender is male & pet1==pet2, points = 5

b. if gender is female & (pet1 is 'cat' or pet1 is 'dog'), points = 5

c. all other combinations, points = 0

    gender    pet1      pet2
0   male      dog       dog
1   male      cat       cat
2   male      dog       cat
3   female    cat       squirrel
4   female    dog       dog
5   female    squirrel  cat
6   squirrel  dog       cat

I would like the end result to be as follows:

    gender    pet1      pet2      points
0   male      dog       dog       5
1   male      cat       cat       5
2   male      dog       cat       0
3   female    cat       squirrel  5
4   female    dog       dog       5
5   female    squirrel  cat       0
6   squirrel  dog       cat       0

How do I accomplish this?

Asked By: Eric

||

Answers:

using apply.

def f(x):
  if x['gender'] == 'male' and x['pet1'] == x['pet2']: return 5
  elif x['gender'] == 'female' and (x['pet1'] == 'cat' or x['pet1'] == 'dog'): return 5
  else: return 0

data['points'] = data.apply(f, axis=1)
Answered By: Ruggero Turra

You can do this using np.where, the conditions use bitwise & and | for and and or with parentheses around the multiple conditions due to operator precedence. So where the condition is true 5 is returned and 0 otherwise:

In [29]:
df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
df

Out[29]:
     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0
Answered By: EdChum

The apply method described by @RuggeroTurra takes a lot longer for 500k rows. I ended up using something like

df['result'] = ((df.a == 0) & (df.b != 1)).astype(int) * 2 + 
               ((df.a != 0) & (df.b != 1)).astype(int) * 3 + 
               ((df.a == 0) & (df.b == 1)).astype(int) * 4 + 
               ((df.a != 0) & (df.b == 1)).astype(int) * 5 

where the apply method took 25 seconds and this method above took about 18ms.

Answered By: leonard

numpy.select

This is a perfect case for np.select where we can create a column based on multiple conditions and it’s a readable method when there are more conditions:

conditions = [
    df['gender'].eq('male') & df['pet1'].eq(df['pet2']),
    df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog'])
]

choices = [5,5]

df['points'] = np.select(conditions, choices, default=0)

print(df)
     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0
Answered By: Erfan

You can also use the apply function. For example:

def myfunc(gender, pet1, pet2):
    if gender=='male' and pet1==pet2:
        myvalue=5
    elif gender=='female' and (pet1=='cat' or pet1=='dog'):
        myvalue=5
    else:
        myvalue=0
    return myvalue

And then using the apply function by setting axis=1

df['points'] = df.apply(lambda x: myfunc(x['gender'], x['pet1'], x['pet2']), axis=1)

We get:

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0
Answered By: George Pipis

One option is with case_when from pyjanitor; under the hood it uses pd.Series.mask.

The basic idea is a pairing of condition and expected value; you can pass as many pairings as required, followed by a default value and a target column name:

# pip install pyjanitor
import pandas as pd
import janitor
df.case_when(
    # condition, value
    df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
    df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
    0, # default
    column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

You could use strings for the conditions, as long as they can be evaluated by pd.eval on the parent dataframe – note that speed wise, this can be slower for small datasets:

df.case_when(
   "gender == 'male' and pet1 == pet2", 5,
   "gender == 'female' and pet2 == ['cat', 'dog']", 5,
   0,
   column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       0
4    female       dog       dog       5
5    female  squirrel       cat       5
6  squirrel       dog       cat       0

Anonymous functions are also possible, which can be handy in chained operations:

df.case_when(
    lambda df: df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
    lambda df: df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
    0, # default
    column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0
Answered By: sammywemmy

Writing the conditions as a string expression and evaluating it using eval() is another method to evaluate the condition and assign values to the column using numpy.where().

# evaluate the condition 
condition = df.eval("gender=='male' and pet1==pet2 or gender=='female' and pet1==['cat','dog']")
# assign values
df['points'] = np.where(condition, 5, 0)

If you have a large dataframe (100k+ rows) and a lot of comparisons to evaluate, this method is probably the fastest pandas method to construct a boolean mask.1

Another advantage of this method over chained & and/or | operators (used in the other vectorized answers here) is better readability (arguably).


1: For a dataframe with 105k rows, if you evaluate 4 conditions where each chain two comparisons, eval() creates a boolean mask substantially faster than chaining bitwise operators.

df = pd.DataFrame([{'gender': 'male', 'pet1': 'dog', 'pet2': 'dog'}, {'gender': 'male', 'pet1': 'cat', 'pet2': 'cat'}, {'gender': 'male', 'pet1': 'dog', 'pet2': 'cat'},{'gender': 'female', 'pet1': 'cat', 'pet2': 'squirrel'},{'gender': 'female', 'pet1': 'dog', 'pet2': 'dog'},{'gender': 'female', 'pet1': 'squirrel', 'pet2': 'cat'},{'gender': 'squirrel', 'pet1': 'dog', 'pet2': 'cat'}]*15_000)

%timeit np.where(df.eval("gender == 'male' and pet1 == pet2 or gender == 'female' and pet1 == ['cat','dog'] or gender == 'female' and pet2 == ['squirrel','dog'] or pet1 == 'cat' and pet2 == 'cat'"), 5, 0)
# 37.9 ms ± 847 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ) | ( (df['gender'] == 'female') & (df['pet2'].isin(['squirrel','dog'] ) ) ) | ( (df['pet1'] == 'cat') & (df['pet2'] == 'cat') ), 5, 0)
# 53.5 ms ± 1.38 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit np.select([df['gender'].eq('male') & df['pet1'].eq(df['pet2']), df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog']), df['gender'].eq('female') & df['pet2'].isin(['squirrel', 'dog']), df['pet1'].eq('cat') & df['pet2'].eq('cat')], [5,5,5,5], default=0)
# 48.9 ms ± 5.06 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.