Pandas dataframe add a field based on multiple if statements

Question:

I’m quite new to Python and Pandas so this might be an obvious question.

I have a dataframe with ages listed in it. I want to create a new field with an age banding. I can use the lambda statement to capture a single if / else statement but I want to use multiple if’s e.g. if age < 18 then 'under 18' elif age < 40 then 'under 40' else '>40'.

I don’t think I can do this using lambda but am not sure how to do it in a different way. I have this code so far:

import pandas as pd
import numpy as n

d = {'Age' : pd.Series([36., 42., 6., 66., 38.]) }

df = pd.DataFrame(d)

df['Age_Group'] =  df['Age'].map(lambda x: '<18' if x < 19 else '>18')

print(df)
Asked By: user3302483

||

Answers:

The pandas DataFrame provides a nice querying ability.

What you are trying to do can be done simply with:

# Set a default value
df['Age_Group'] = '<40'
# Set Age_Group value for all row indexes which Age are greater than 40
df['Age_Group'][df['Age'] > 40] = '>40'
# Set Age_Group value for all row indexes which Age are greater than 18 and < 40
df['Age_Group'][(df['Age'] > 18) & (df['Age'] < 40)] = '>18'
# Set Age_Group value for all row indexes which Age are less than 18
df['Age_Group'][df['Age'] < 18] = '<18'

The querying here is a powerful tool of the dataframe and will allow you to manipulate the DataFrame as you need.

For more complex conditionals, you can specify multiple conditions by encapsulating each condition in parenthesis and separating them with a boolean operator ( eg. ‘&’ or ‘|’)

You can see this in work here for the second conditional statement for setting >18.

Edit:

You can read more about indexing of DataFrame and conditionals:

http://pandas.pydata.org/pandas-docs/dev/indexing.html#index-objects

Edit:

To see how it works:

>>> d = {'Age' : pd.Series([36., 42., 6., 66., 38.]) }
>>> df = pd.DataFrame(d)
>>> df
   Age
0   36
1   42
2    6
3   66
4   38
>>> df['Age_Group'] = '<40'
>>> df['Age_Group'][df['Age'] > 40] = '>40'
>>> df['Age_Group'][(df['Age'] > 18) & (df['Age'] < 40)] = '>18'
>>> df['Age_Group'][df['Age'] < 18] = '<18'
>>> df
   Age Age_Group
0   36       >18
1   42       >40
2    6       <18
3   66       >40
4   38       >18

Edit:

To see how to do this without the chaining [using EdChums approach].

>>> df['Age_Group'] = '<40'
>>> df.loc[df['Age'] < 40,'Age_Group'] = '<40'
>>> df.loc[(df['Age'] > 18) & (df['Age'] < 40), 'Age_Group'] = '>18'
>>> df.loc[df['Age'] < 18,'Age_Group'] = '<18'
>>> df
   Age Age_Group
0   36       >18
1   42       <40
2    6       <18
3   66       <40
4   38       >18
Answered By: Ryan G

You can also do a nested np.where()

df['Age_group'] = np.where(df.Age<18, 'under 18',
                           np.where(df.Age<40,'under 40', '>40'))
Answered By: Scarlett Zuo

pyjanitor has a case_when function (currently in dev) for creating/mutating a column based on conditions; under the hood, it is powered py pandas’ mask function:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
df.case_when(
      df.Age.between(18, 40, inclusive='neither'), '>18', # condition, value
      df.Age.lt(18), '<18',                               # condition, value
      '>40',                                              # default, if no matches
      column_name = 'Age_group')

    Age Age_group
0  36.0       >18
1  42.0       >40
2   6.0       <18
3  66.0       >40
4  38.0       >18
Answered By: sammywemmy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.