Using an `if` statement inside a Pandas DataFrame's `assign` method

Question:

Intro and reproducible code snippet

I’m having a hard time performing an operation on a few columns that requires the checking of a condition using an if/else statement.

More specifically, I’m trying to perform this check within the confines of the assign method of a Pandas Dataframe. Here is an example of what I’m trying to do

# Importing Pandas
import pandas as pd

# Creating synthetic data
my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
                      'col2':[11,22,33,44,55,66,77,88,99,1010]})

# Creating a separate output DataFrame that doesn't overwrite 
# the original input DataFrame
out_df = my_df.assign(
    # Successfully creating a new column called `col3` using a lambda function
    col3=lambda row: row['col1'] + row['col2'],

    # Using a new lambda function to perform an operation on the newly 
    # generated column. 
    bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')

The code above yeilds a ValueError:

ValueError: The truth value of a Series is ambiguous

When trying to investigate the error, I found this SO thread. It seems that lambda functions don’t always work very nicely with conditional logic in a DataFrame, mostly due to the DataFrame’s attempt to deal with things as Series.

A few dirty workarounds

Use apply

A dirty workaround would be to make col3 using the assign method as indicated above, but then create the bleep_bloop column using an apply method instead:

out_sr = (my_df.assign(
    col3=lambda row: row['col1'] + row['col2'])
    .apply(lambda row: 'bleep' if (row['col3']%8 == 0) 
                               else 'bloop', axis=1))

The problem here is that the code above returns only a Series with the results of the bleep_bloop column instead of a new DataFrame with both col3 and bleep_bloop.

On the fly vs. multiple commands

Yet another approach would be to break one command into two:

out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0) 
                               else 'bloop', axis=1)

This also works, but I’d really like to stick to the on-the-fly approach where I do everything in one chained command, if possible.

Back to the main question

Given that the workarounds I showed above are messy and don’t really get the job done like I need, is there any other way I can create a new column that’s based on using a conditional if/else statement?

The example I gave here is pretty simple, but consider that the real world application would likely involve applying custom-made functions (e.g.: out_df=my_df.assign(new_col=lambda row: my_func(row)), where my_func is some complex function that uses several other columns from the same row as inputs).

Asked By: Felipe D.

||

Answers:

Your mistake is that you considered the lambda to act on rows, while it acts on full columns in a vectorized way. You need to use vectorized functions:

import numpy as np

out_df = my_df.assign(
    col3=lambda d: d['col1'] + d['col2'],
    bleep_bloop=lambda d: np.where(d['col3']%8, 'bloop', 'bleep')
)

print(out_df)

Output:

   col1  col2  col3 bleep_bloop
0     1    11    12       bloop
1     2    22    24       bleep
2     3    33    36       bloop
3     4    44    48       bleep
4     5    55    60       bloop
5     6    66    72       bleep
6     7    77    84       bloop
7     8    88    96       bleep
8     9    99   108       bloop
9    10  1010  1020       bloop
Answered By: mozway

Or for more than 2 conditions you can use np.select:

import numpy as np  
out_df=(my_df.assign(
    col3 = lambda df_ : df_['col1'] + df_['col2'],
    bleep_bloop=lambda df_: np.select(condlist=[df_['col3']%8==0,
                                                df_['col3']%8==1,
                                                df_['col3']>100 ],
                                      choicelist=['bleep',
                                                  'bloop',
                                                  'bliip'],
                                      default='bluup')))

The good thing about np.select is that it works like where(vectorized functions therefore faster) and you can put as many condition you want.

Answered By: galk32

Since you will be needing a complex logic in your final column, as you mentioned it makes sense to create a separate function for it and apply it to the rows.

def my_func(x):
    if (x['col1'] + x['col2']) % 8 == 0:
        return 'bleep'
    else:
        return 'bloop'

my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)

When you pass the x to the function, you are in fact passing each row and can use any of the column values inside your function like x[‘col1’] and so on. This way you can create as complex a function as you need. Note that axis=1 is required here to pass the rows.

I did not include creation of col3 just to provide a sample.

Answered By: Yashar