Using an `if` statement inside a Pandas DataFrame's `assign` method
Question:
Intro and reproducible code snippet
I’m having a hard time performing an operation on a few columns that requires the checking of a condition using an if/else
statement.
More specifically, I’m trying to perform this check within the confines of the assign
method of a Pandas Dataframe. Here is an example of what I’m trying to do
# Importing Pandas
import pandas as pd
# Creating synthetic data
my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
'col2':[11,22,33,44,55,66,77,88,99,1010]})
# Creating a separate output DataFrame that doesn't overwrite
# the original input DataFrame
out_df = my_df.assign(
# Successfully creating a new column called `col3` using a lambda function
col3=lambda row: row['col1'] + row['col2'],
# Using a new lambda function to perform an operation on the newly
# generated column.
bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')
The code above yeilds a ValueError
:
ValueError: The truth value of a Series is ambiguous
When trying to investigate the error, I found this SO thread. It seems that lambda
functions don’t always work very nicely with conditional logic in a DataFrame, mostly due to the DataFrame’s attempt to deal with things as Series.
A few dirty workarounds
Use apply
A dirty workaround would be to make col3
using the assign
method as indicated above, but then create the bleep_bloop
column using an apply
method instead:
out_sr = (my_df.assign(
col3=lambda row: row['col1'] + row['col2'])
.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1))
The problem here is that the code above returns only a Series with the results of the bleep_bloop
column instead of a new DataFrame with both col3
and bleep_bloop
.
On the fly vs. multiple commands
Yet another approach would be to break one command into two:
out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1)
This also works, but I’d really like to stick to the on-the-fly approach where I do everything in one chained command, if possible.
Back to the main question
Given that the workarounds I showed above are messy and don’t really get the job done like I need, is there any other way I can create a new column that’s based on using a conditional if/else
statement?
The example I gave here is pretty simple, but consider that the real world application would likely involve applying custom-made functions (e.g.: out_df=my_df.assign(new_col=lambda row: my_func(row))
, where my_func
is some complex function that uses several other columns from the same row as inputs).
Answers:
Your mistake is that you considered the lambda to act on rows, while it acts on full columns in a vectorized way. You need to use vectorized functions:
import numpy as np
out_df = my_df.assign(
col3=lambda d: d['col1'] + d['col2'],
bleep_bloop=lambda d: np.where(d['col3']%8, 'bloop', 'bleep')
)
print(out_df)
Output:
col1 col2 col3 bleep_bloop
0 1 11 12 bloop
1 2 22 24 bleep
2 3 33 36 bloop
3 4 44 48 bleep
4 5 55 60 bloop
5 6 66 72 bleep
6 7 77 84 bloop
7 8 88 96 bleep
8 9 99 108 bloop
9 10 1010 1020 bloop
Or for more than 2 conditions you can use np.select:
import numpy as np
out_df=(my_df.assign(
col3 = lambda df_ : df_['col1'] + df_['col2'],
bleep_bloop=lambda df_: np.select(condlist=[df_['col3']%8==0,
df_['col3']%8==1,
df_['col3']>100 ],
choicelist=['bleep',
'bloop',
'bliip'],
default='bluup')))
The good thing about np.select is that it works like where(vectorized functions therefore faster) and you can put as many condition you want.
Since you will be needing a complex logic in your final column, as you mentioned it makes sense to create a separate function for it and apply it to the rows.
def my_func(x):
if (x['col1'] + x['col2']) % 8 == 0:
return 'bleep'
else:
return 'bloop'
my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)
When you pass the x to the function, you are in fact passing each row and can use any of the column values inside your function like x[‘col1’] and so on. This way you can create as complex a function as you need. Note that axis=1 is required here to pass the rows.
I did not include creation of col3 just to provide a sample.
Intro and reproducible code snippet
I’m having a hard time performing an operation on a few columns that requires the checking of a condition using an if/else
statement.
More specifically, I’m trying to perform this check within the confines of the assign
method of a Pandas Dataframe. Here is an example of what I’m trying to do
# Importing Pandas
import pandas as pd
# Creating synthetic data
my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
'col2':[11,22,33,44,55,66,77,88,99,1010]})
# Creating a separate output DataFrame that doesn't overwrite
# the original input DataFrame
out_df = my_df.assign(
# Successfully creating a new column called `col3` using a lambda function
col3=lambda row: row['col1'] + row['col2'],
# Using a new lambda function to perform an operation on the newly
# generated column.
bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')
The code above yeilds a ValueError
:
ValueError: The truth value of a Series is ambiguous
When trying to investigate the error, I found this SO thread. It seems that lambda
functions don’t always work very nicely with conditional logic in a DataFrame, mostly due to the DataFrame’s attempt to deal with things as Series.
A few dirty workarounds
Use apply
A dirty workaround would be to make col3
using the assign
method as indicated above, but then create the bleep_bloop
column using an apply
method instead:
out_sr = (my_df.assign(
col3=lambda row: row['col1'] + row['col2'])
.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1))
The problem here is that the code above returns only a Series with the results of the bleep_bloop
column instead of a new DataFrame with both col3
and bleep_bloop
.
On the fly vs. multiple commands
Yet another approach would be to break one command into two:
out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1)
This also works, but I’d really like to stick to the on-the-fly approach where I do everything in one chained command, if possible.
Back to the main question
Given that the workarounds I showed above are messy and don’t really get the job done like I need, is there any other way I can create a new column that’s based on using a conditional if/else
statement?
The example I gave here is pretty simple, but consider that the real world application would likely involve applying custom-made functions (e.g.: out_df=my_df.assign(new_col=lambda row: my_func(row))
, where my_func
is some complex function that uses several other columns from the same row as inputs).
Your mistake is that you considered the lambda to act on rows, while it acts on full columns in a vectorized way. You need to use vectorized functions:
import numpy as np
out_df = my_df.assign(
col3=lambda d: d['col1'] + d['col2'],
bleep_bloop=lambda d: np.where(d['col3']%8, 'bloop', 'bleep')
)
print(out_df)
Output:
col1 col2 col3 bleep_bloop
0 1 11 12 bloop
1 2 22 24 bleep
2 3 33 36 bloop
3 4 44 48 bleep
4 5 55 60 bloop
5 6 66 72 bleep
6 7 77 84 bloop
7 8 88 96 bleep
8 9 99 108 bloop
9 10 1010 1020 bloop
Or for more than 2 conditions you can use np.select:
import numpy as np
out_df=(my_df.assign(
col3 = lambda df_ : df_['col1'] + df_['col2'],
bleep_bloop=lambda df_: np.select(condlist=[df_['col3']%8==0,
df_['col3']%8==1,
df_['col3']>100 ],
choicelist=['bleep',
'bloop',
'bliip'],
default='bluup')))
The good thing about np.select is that it works like where(vectorized functions therefore faster) and you can put as many condition you want.
Since you will be needing a complex logic in your final column, as you mentioned it makes sense to create a separate function for it and apply it to the rows.
def my_func(x):
if (x['col1'] + x['col2']) % 8 == 0:
return 'bleep'
else:
return 'bloop'
my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)
When you pass the x to the function, you are in fact passing each row and can use any of the column values inside your function like x[‘col1’] and so on. This way you can create as complex a function as you need. Note that axis=1 is required here to pass the rows.
I did not include creation of col3 just to provide a sample.