How to conditionally concat 2 columns in Python Pandas Dataframe

Question:

I have the following dataframe:

d = {'col1': [1, "Avoid", 3, "Avoid"], 'col2': ["AA", "BB", "Avoid", "Avoid"]}
df = pd.DataFrame(data=d)
df

    col1    col2
0   1       AA
1   Avoid   BB
2   3       Avoid
3   Avoid   Avoid

I have to conditionally concat col1 and col2 into col3. Conditions:

  • Only concat 2 columns as long as none of them is Avoid.
  • If any of col1 and col2 is Avoid, col3 will be equal to Avoid as well.
  • When performing concatination, " & " needs to be added between column values in col3. For instance, first row of col3 will be "1 & AA".

The end result is supposed to look as the following:

    col1    col2    col3
0   1       AA      1 & AA
1   Avoid   BB      Avoid
2   3       Avoid   Avoid
3   Avoid   Avoid   Avoid

How can I do this without dealing with for loops?

Asked By: edn

||

Answers:

try this:

df["col3"]=df.apply(lambda x:"Avoid" if x["col1"]=="Avoid" or x["col2"]=="Avoid" else f"{x['col1']} & {x['col2']}",axis=1)
Answered By: to_data

Is not an efficient way to work with pandas, but if you can’t change the data structure these are solutions:

Solution 1:

def custom_merge(cols):
    if cols["col1"]=="Avoid" or cols["col2"]=="Avoid":
        return "Avoid"
    else:
        return f"{cols['col1']} & {cols['col2']}"
    
df['col3'] = df.apply(custom_merge, axis=1)

Solution 2:

df['col3'] = (df["col1"].astype(str) + " & " + df["col2"].astype(str)).apply(lambda x: "Avoid" if 'Avoid' in x else x)

Both solutions results in the following:

    col1    col2    col3
0   1       AA      1 & AA
1   Avoid   BB      Avoid
2   3       Avoid   Avoid
3   Avoid   Avoid   Avoid

Execution Time comparison

In this section I will count the execution times of the proposed solutions.
@mozway proposed 2 very tricky solutions in his answer, which I will call Solution 3a and Solution 3b. Another interesting solution is @sammywemmy’s solution that uses a list comprehension and then adds the list to the dataframe, I will call this solution solution 4

The instances of the experiments will have the following structure:

import pandas as pd
n = 100000
d = {'col1': [1, "Avoid", 3, "Avoid"] * n, 'col2': ["AA", "BB", "Avoid", "Avoid"] * n}
df = pd.DataFrame(data=d)

Execution time:

  • Solution 1:

    %timeit 3.56 s ± 71.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  • Solution 2:

    %timeit 140 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
  • Solution 3a:

    %timeit 3.44 s ± 77.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  • Solution 3b:

    %timeit 893 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  • Solution 4:

    %timeit 191 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

Solution 1 and solution 3a have similar execution times on the instance under consideration. Solution 3b runs 4 times faster than solutions 1 and 3a. The fastest solution is solution 2 which goes about 7 times faster than solution 2 and about 25 times faster than solutions 1 and 3a. This is because solution 2 takes advantage of pandas vectorization. Solution 4 has a similar runtime as solution 2, taking advantage of list comprehension for the merge operation (without using pandas).

TIPS:
If you can change the data format, the advice is to format the data so that you can use the native pandas functions to do the join operations or if you can’t change the data format and can do without pandas you might have a slight speed up over solution 2 using dictionaries or lists, doing the merge operation using list comphrehnsion.

Answered By: Massifox

Use boolean operations, this enables you to use an arbitrary number of columns:

# is any value in the row "Avoid"?
m = df.eq('Avoid').any(1)

# concatenate all columns unless there was a "Avoid"
df['col3'] = df.astype(str).agg(' & '.join, axis=1).mask(m, 'Avoid')

Alternative that should be faster if you have many rows and few with "Avoid":

m = df.ne('Avoid').all(1)

df.loc[m, 'col3'] = df[m].astype(str).agg(' & '.join, axis=1)

df['col3'] = df['col3'].fillna('Avoid')

output:

    col1   col2    col3
0      1     AA  1 & AA
1  Avoid     BB   Avoid
2      3  Avoid   Avoid
3  Avoid  Avoid   Avoid
Answered By: mozway
df["col3"] = df["col1"] + " & " + df["col2"]
df["col3"] = df["col3"].apply(lambda x: "Avoid" if x.contains("Avoid") else x)
Answered By: the_ordinary_guy

Run a list comprehension on the strings in plain python:

out = [f"{l}&{r}" 
        if 'Avoid' not in {l, r} 
        else 'Avoid' 
        for l, r in zip(df.col1, df.col2)]
df.assign(col3 = out)
    col1   col2   col3
0      1     AA   1&AA
1  Avoid     BB  Avoid
2      3  Avoid  Avoid
3  Avoid  Avoid  Avoid
Answered By: sammywemmy