How to combine two columns if one is empty

Question:

I have a table:

A B C
x 1 NA
y NA 4
z 2 NA
p NA 5
t 6 7

I want to create a new column D which should combine columns B and C if one of the columns is empty (NA):

A B C D
x 1 NA 1
y NA 4 4
z 2 NA 2
p NA 5 5
t 6 7 error

In case both columns contain a value, it should return the text ‘error’ inside the cell.

Asked By: honeymoon

||

Answers:

There as several ways to achieve this.

Using fillna and mask

df['D'] = df['B'].fillna(df['C']).mask(df['B'].notna()&df['C'].notna(), 'error')

Or numpy.select:

m1 = df['B'].notna()
m2 = df['C'].notna()
df['D'] = np.select([m1&m2, m1], ['error', df['B']], df['C'])

Output:

   A    B    C      D
0  x  1.0  NaN    1.0
1  y  NaN  4.0    4.0
2  z  2.0  NaN    2.0
3  p  NaN  5.0    5.0
4  t  6.0  7.0  error
Answered By: mozway

You could first calculate a mask with rows where both values are present and then fill NA values of, let’s say column B, with values from column C. Using the mask calculated in the first step simply assign NA values where needed.

error_mask = df['B'].notna() & df['C'].notna()
df['D'] = df['B'].fillna(df['C'])
df.loc[error_mask, 'D'] = pd.NA
df

   A     B     C     D
0  x     1  <NA>     1
1  y  <NA>     4     4
2  z     2  <NA>     2
3  p     3     5  <NA>


OR
df = df['D'].astype(str)
df.loc[error_mask, 'D'] = 'error'

I would suggest against assigning a string error where both values are present, since that would make the whole D column an object dtype

Answered By: Grinjero

Adding to the previous answer, you can address this with a series of .apply() methods paired with lambda functions.

Consider the dataframe that you presented, with np.nan as the NA values:

df = pd.DataFrame({
    'B':[1, np.nan, 2, np.nan, 6], 
    'C':[np.nan, 4, np.nan, 5, 7]})

First generate a list of the elements from the series in question:

df['D'] = df.apply(lambda x: list(x), axis=1) 

This will net you a pd.Series with a list of values as elements, e.g. [1.0, nan] for the first row. Next, remove all np.nan elements by using that np.nan != np.nan in numpy (see also an answer here: How can I remove Nan from list Python/NumPy)

df['E'] = df['D'].apply(lambda x: [i for i in x if i == i])

Finally, create the error by filtering based on length.

df['F'] =  df['E'].apply(lambda x: x[0] if len(x) == 1 else 'error')

The resulting dataframe works like this:

    B   C   D   E   F
0   1.0 NaN [1.0, nan]  [1.0]   1.0
1   NaN 4.0 [nan, 4.0]  [4.0]   4.0
2   2.0 NaN [2.0, nan]  [2.0]   2.0
3   NaN 5.0 [nan, 5.0]  [5.0]   5.0
4   6.0 7.0 [6.0, 7.0]  [6.0, 7.0]  error

Of course you could chain all this together in a not-so-pythonic, yet single-line answer:
a = df.apply(lambda x: list(x), axis=1).apply(lambda x: [i for i in x if i == i]).apply(lambda x: x[0] if len(x) == 1 else 'error')

Answered By: dbouz

Have a look at the function combine_first:

df['C'].combine_first(df['B']).mask(df['B'].notna() & df['C'].notna(), 'error')

Output:

0      1.0
1      4.0
2      2.0
3      5.0
4    error
Name: C, dtype: object
Answered By: Mykola Zotko
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.