Pandas – using assign and if-else statement in method chaining

Question:

I come from an R background and I’m trying to replicate the mutate() function from dplyr in pandas.

I have a dataframe that looks like this:

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])

I am now trying to create a new column called age_bracket using assign method as follows:

(df.
    assign(age_bracket= lambda x: "under 25" if x['age'] < 25 else
        ("25-34" if x['age'] < 35 else "35+"))

And this is throwing the following error which I’m not able to understand:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

I am not interested the following solution:

df['age_bracket'] = np.where(df.age < 25, 'under 25',
     (np.where(df.age < 35, "25-34", "35+")))

As I do not want the underlying df to change. I’m trying to get better at method chaining where I can quickly explore my df in different ways without changing the underlying df.

Any suggestions?

Asked By: Rachit Kinger

||

Answers:

It is possible, but not recommended, because loops (under the hood of apply function):

df = (df.
    assign(age_bracket= lambda x: x['age'].apply(lambda y: "under 25" if y < 25 else
        ("25-34" if y < 35 else "35+"))))
print (df)
    name  age  preTestScore  postTestScore age_bracket
0  Jason   42             4             25         35+
1  Molly   52            24             94         35+
2   Tina   36            31             57         35+
3   Jake   24             2             62    under 25
4    Amy   73             3             70         35+

Or numpy.select:

df = df.assign(age_bracket= np.select([df.age < 25,df.age < 35], ['under 25', "25-34"], "35+"))

But better is use cut here:

df = (df.assign(age_bracket= lambda x: pd.cut(x['age'], 
                                              bins=[0, 25, 35, 150],
                                              labels=["under 25", "25-34", "35+"])))
Answered By: jezrael

Why not use assign with np.where?

df.assign(age_bracket = np.where(df.age < 25, 'under 25',
     (np.where(df.age < 35, "25-34", "35+"))))

You are returned a copy of the original dataframe with new column.

But I agree with @jezrael pd.cut is better my opinion.

Output:

    name  age  preTestScore  postTestScore age_bracket
0  Jason   42             4             25         35+
1  Molly   52            24             94         35+
2   Tina   36            31             57         35+
3   Jake   24             2             62    under 25
4    Amy   73             3             70         35+
Answered By: Scott Boston

Easy to use the same syntax in python as you did in R, using datar:

>>> from datar.all import f, tibble, mutate, if_else
>>> 
>>> data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
...         'age': [42, 52, 36, 24, 73], 
...         'preTestScore': [4, 24, 31, 2, 3],
...         'postTestScore': [25, 94, 57, 62, 70]}
>>> 
>>> df = tibble(**data)
>>> df >> mutate(age_bracket=if_else(
...   f.age < 25, 
...   "under 25",
...   if_else(f.age < 35, "25-34", "35+")
... ))
      name     age  preTestScore  postTestScore age_bracket
  <object> <int64>       <int64>        <int64>    <object>
0    Jason      42             4             25         35+
1    Molly      52            24             94         35+
2     Tina      36            31             57         35+
3     Jake      24             2             62    under 25
4      Amy      73             3             70         35+

Disclaimer: I am the author of the datar package.

Answered By: Panwen Wang

pyjanitor has a case_when implementaton in dev that could be helpful in this case, the implementation idea is inspired by if_else in pydatatable and fcase in R’s data.table; under the hood, it uses pd.Series.mask:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn

df.case_when(
   df.age.lt(25), 'under 25',  # 1st condition, result
   df.age.lt(35), '25-34',    # 2nd condition, result
   '35+',                     # default
   column_name = 'age_bracket')

    name  age  preTestScore  postTestScore age_bracket
0  Jason   42             4             25         35+
1  Molly   52            24             94         35+
2   Tina   36            31             57         35+
3   Jake   24             2             62    under 25
4    Amy   73             3             70         35+

For this use case though, since you are partitioning on categories, pd.cut solution by @jezrael is more efficient.

Answered By: sammywemmy