Coalesce values from 2 columns into a single column in a pandas dataframe

Question:

I’m looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I’d like to create a new column using the following rules:

  1. If the value in column A is not null, use that value for the new column C
  2. If the value in column A is null, use the value in column B for the new column C

Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven’t found a good pythonic method for this; does one exist?

Asked By: Sevyns

||

Answers:

use combine_first():

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))

In [17]: df.loc[::2, 'a'] = np.nan

In [18]: df
Out[18]:
     a  b
0  NaN  0
1  5.0  5
2  NaN  8
3  2.0  8
4  NaN  3
5  9.0  4
6  NaN  7
7  2.0  0
8  NaN  6
9  2.0  5

In [19]: df['c'] = df.a.combine_first(df.b)

In [20]: df
Out[20]:
     a  b    c
0  NaN  0  0.0
1  5.0  5  5.0
2  NaN  8  8.0
3  2.0  8  2.0
4  NaN  3  3.0
5  9.0  4  9.0
6  NaN  7  7.0
7  2.0  0  2.0
8  NaN  6  6.0
9  2.0  5  2.0

Try this also.. easier to remember:

df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )

This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )

%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop


%timeit  df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
Answered By: Merlin

combine_first is the most straightforward option. There are a couple of others which I outline below. I’m going to outline a few more solutions, some applicable to different cases.

Case #1: Non-mutually Exclusive NaNs

Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.

df = pd.DataFrame({
    'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
    'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})      
df

     a    b
0  1.0  5.0
1  2.0  3.0
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN
5  7.0  6.0
6  NaN  7.0

Let’s combine first on a.

Series.mask

df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    7.0
6    7.0
Name: a, dtype: float64

Series.where

df['a'].where(pd.notnull, df['b'])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    7.0
6    7.0
Name: a, dtype: float64

You can use similar syntax using np.where.

Alternatively, to combine first on b, switch the conditions around.


Case #2: Mutually Exclusive Positioned NaNs

All rows have NaNs which are mutually exclusive between columns.

df = pd.DataFrame({
    'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
    'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df

     a    b
0  1.0  NaN
1  2.0  NaN
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN
5  NaN  6.0
6  NaN  7.0

Series.update

This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.

df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
3  NaN  4.0
4  5.0  5.0
5  NaN  6.0
6  NaN  7.0

Series.add

df['a'].add(df['b'], fill_value=0)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

DataFrame.fillna + DataFrame.sum

df.fillna(0).sum(1)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64
Answered By: cs95

I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:

Build dummy data

import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
                   'a2': [2, None, 4, None],
                   'a3': [4, 5, None, None],
                   'a4': [None, None, None, None],
                   'b1': [9, 9, 9, 999]})

df
    a1   a2   a3    a4   b1
0  NaN  2.0  4.0  None    9
1  2.0  NaN  5.0  None    9
2  3.0  4.0  NaN  None    9
3  NaN  NaN  NaN  None  999

coalesce a1 a2, a3 into a new column A

def get_first_non_null(dfrow, columns_to_search):
    for c in columns_to_search:
        if pd.notnull(dfrow[c]):
            return dfrow[c]
    return None

# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)

print(df)
    a1   a2   a3    a4   b1    A
0  NaN  2.0  4.0  None    9  2.0
1  2.0  NaN  5.0  None    9  2.0
2  3.0  4.0  NaN  None    9  3.0
3  NaN  NaN  NaN  None  999  NaN
Answered By: David Smith

I’m thinking a solution like this,

def coalesce(s: pd.Series, *series: List[pd.Series]):
    """coalesce the column information like a SQL coalesce."""
    for other in series:
        s = s.mask(pd.isnull, other)        
    return s

because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,

df['d'] = coalesce(df.a, df.b, df.c)
Answered By: Christian DiMare

For a more general case, where there are no NaNs but you want the same behavior:

Merge 'left', but override 'right' values where possible

Answered By: Cilantro Ditrek

Good code, put you have a typo for python 3, correct one looks like this

    """coalesce the column information like a SQL coalesce."""
    for other in series:
        s = s.mask(pd.isnull, other)        
    return s
Answered By: Stefan Voshage

Coalesce for multiple columns with DataFrame.bfill

All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:

example dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
                   'col2':[np.NaN, 5, 1, 0, np.NaN],
                   'col3':[2, np.NaN, 9, 1, np.NaN],
                   'col4':[np.NaN, 10, 11, 4, 8]})

print(df)

   col1  col2  col3  col4
0   NaN   NaN   2.0   NaN
1   2.0   5.0   NaN  10.0
2   4.0   1.0   9.0  11.0
3   5.0   0.0   1.0   4.0
4   NaN   NaN   NaN   8.0

Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns

Plus, this would also work for string type columns !!

df['coalesce'] = df.bfill(axis=1).iloc[:, 0]

   col1  col2  col3  col4  coalesce
0   NaN   NaN   2.0   NaN       2.0
1   2.0   5.0   NaN  10.0       2.0
2   4.0   1.0   9.0  11.0       4.0
3   5.0   0.0   1.0   4.0       5.0
4   NaN   NaN   NaN   8.0       8.0

Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow

df['coalesce'] = (
    df['col1'].combine_first(df['col2'])
        .combine_first(df['col3'])
        .combine_first(df['col4'])
)

   col1  col2  col3  col4  coalesce
0   NaN   NaN   2.0   NaN       2.0
1   2.0   5.0   NaN  10.0       2.0
2   4.0   1.0   9.0  11.0       4.0
3   5.0   0.0   1.0   4.0       5.0
4   NaN   NaN   NaN   8.0       8.0
Answered By: Erfan

Consider using DuckDB for efficient SQL on Pandas. It’s performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html

Sample Dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
                   'B':[np.NaN, 2, 3, 4, np.NaN]})

Coalesce using DuckDB:

import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)

Output:

     A    B    c
0  1.0  NaN  1.0
1  NaN  2.0  2.0
2  3.0  3.0  3.0
3  4.0  4.0  4.0
4  5.0  NaN  5.0
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.