Concatenate column values in a pandas DataFrame while ignoring NaNs

Question:

I have a the following pandas table

df:

 EVNT_ID col1 col2 col3 col4
 123454   1    Nan   4    5
 628392   Nan   3   Nan   7
 293899   2    Nan  Nan   6
 127820   9    11    12   19

Now I am trying to concat all the columns except the first column and I want my data frame to look in the following way

new_df:

 EVNT_ID col1 col2 col3 col4 new_col
 123454   1    Nan   4    5   1|4|5
 628392   Nan   3   Nan   7    3|7
 293899   2    Nan  Nan   6    2|6
 127820   9    11    12   19  9|11|12|19

I am using the following code

df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)

but it is giving me the following error

ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I would really appreciate if any one can give me where I am wrong. I’d really appreciate that.

Asked By: NAB0815

||

Answers:

You can do this with filter and agg:

df.filter(like='col').agg(
    lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)

0         1|4|5
1           3|7
2           2|6
3    9|11|12|19
dtype: object

Or,

df.drop('EVNT_ID', 1).agg(
        lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)

0         1|4|5
1           3|7
2           2|6
3    9|11|12|19
dtype: object

If performance is important, you can use a list comprehension:

joined = [
    '|'.join([str(int(x)) for x in r if pd.notna(x)]) 
    for r in df.iloc[:,1:].values.tolist()
]
joined
# ['1|4|5', '3|7', '2|6', '9|11|12|19']

df.assign(new_col=joined)   

   EVNT_ID  col1  col2  col3  col4     new_col
0   123454   1.0   NaN   4.0     5       1|4|5
1   628392   NaN   3.0   NaN     7         3|7
2   293899   2.0   NaN   NaN     6         2|6
3   127820   9.0  11.0  12.0    19  9|11|12|19

If you can forgive the overhead of assignment to a DataFrame, here’s timings for the two fastest solutions here.

df = pd.concat([df] * 1000, ignore_index=True)

# In this post.
%%timeit
[
     '|'.join([str(int(x)) for x in r if pd.notna(x)]) 
     for r in df.iloc[:,1:].values.tolist()
]
# RafaelC's answer.
%%timeit
[
    '|'.join([k for k in a if k])
    for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
]    

31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Although note the answers aren’t identical because @RafaelC’s code produces floats: ['1.0|2.0|9.0', '3.0|11.0', ...]. If this is fine, then great. Otherwise you’ll need to convert to int which adds more overhead.

Answered By: cs95

Using list comprehension and zip

>>> [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]

Timing seems alright

df = pd.concat([df]*1000)

%timeit [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
10.8 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.filter(like='col').agg(lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
1.68 s ± 91.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.iloc[:, 1:].apply(lambda x: '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
87.8 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.assign(new_col=['|'.join([str(int(x)) for x in r if ~np.isnan(x)]) for r in df.iloc[:,1:].values])
45.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Answered By: rafaelc

Try the following code:

df['new_col'] = df.iloc[:, 1:].apply(lambda x:
    '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)

Initially I thought about x.dropna() instead of x if str(el) != 'nan',
but %timeit showed that dropna() works much slower.

Answered By: Valdi_Bo
import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime

df = pd.DataFrame({
    'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
    'A' : [1,  np.nan,4, 7],
    'B' : [2, np.nan, 5, 8],
    'C' : [3, 6, 9, np.nan]
}).set_index('date')

print(df)

print('.........')

start_time = datetime.now()
df['ColumnA'] = df[df.columns].agg(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)
print(df['ColumnA'])

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))


"""
              A    B    C
date                     
05/9/2023   1.0  2.0  3.0
07/10/2023  NaN  NaN  6.0
08/11/2023  4.0  5.0  9.0
06/12/2023  7.0  8.0  NaN
...........................
OUTPUT:

date
05/9/2023     1.0,2.0,3.0
07/10/2023            6.0
08/11/2023    4.0,5.0,9.0
06/12/2023        7.0,8.0
Name: ColumnA, dtype: object
Duration: 0:00:00.002998

"""
Answered By: Soudipta Dutta
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.