pandas combine two strings ignore nan values
Question:
I have two columns with strings. I would like to combine them and ignore nan
values. Such that:
ColA, Colb, ColA+ColB
str str strstr
str nan str
nan str str
I tried df['ColA+ColB'] = df['ColA'] + df['ColB']
but that creates a nan value if either column is nan. I’ve also thought about using concat
.
I suppose I could just go with that, and then use some df.ColA+ColB[df[ColA] = nan] = df[ColA]
but that seems like quite the workaround.
Answers:
You could fill the NaN with an empty string:
df['ColA+ColB'] = df['ColA'].fillna('') + df['ColB'].fillna('')
Call fillna
and pass an empty str as the fill value and then sum
with param axis=1
:
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN]})
df
Out[3]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [7]:
df['a+b'] = df.fillna('').sum(axis=1)
df
Out[7]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa
Using apply
and str.cat
you can
In [723]: df
Out[723]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [724]: df['a+b'] = df.apply(lambda x: x.str.cat(sep=''), axis=1)
In [725]: df
Out[725]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa
Prefer adding the columns than use apply
method. cuz it’s faster than apply
.
-
Just add the two columns (if you know they are strings)
%timeit df.bio + df.procedure_codes
21.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
-
Use apply
%timeit df[eventcol].apply(lambda x: ''.join(x), axis=1)
13.6 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
-
Use Pandas string methods and cat:
%timeit df[eventcol[0]].str.cat(cols, sep=',')
264 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
-
Using sum (which concatenate strings)
%timeit df[eventcol].sum(axis=1)
509 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
see here for more tests
In my case, I wanted to join more than 2 columns together with a separator (a+b+c)
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN], 'c':['as',np.NaN ,'ds']})
In [4]: df
Out[4]:
a b c
0 asd asdas as
1 NaN asdas NaN
2 asdsa NaN ds
The following syntax worked for me:
In [5]: df['d'] = df[['a', 'b', 'c']].fillna('').agg('|'.join, axis=1)
In [6]: df
Out[6]:
a b c d
0 asd asdas as asd|asdas|as
1 NaN asdas NaN |asdas|
2 asdsa NaN ds asdsa||ds
I have two columns with strings. I would like to combine them and ignore nan
values. Such that:
ColA, Colb, ColA+ColB
str str strstr
str nan str
nan str str
I tried df['ColA+ColB'] = df['ColA'] + df['ColB']
but that creates a nan value if either column is nan. I’ve also thought about using concat
.
I suppose I could just go with that, and then use some df.ColA+ColB[df[ColA] = nan] = df[ColA]
but that seems like quite the workaround.
You could fill the NaN with an empty string:
df['ColA+ColB'] = df['ColA'].fillna('') + df['ColB'].fillna('')
Call fillna
and pass an empty str as the fill value and then sum
with param axis=1
:
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN]})
df
Out[3]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [7]:
df['a+b'] = df.fillna('').sum(axis=1)
df
Out[7]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa
Using apply
and str.cat
you can
In [723]: df
Out[723]:
a b
0 asd asdas
1 NaN asdas
2 asdsa NaN
In [724]: df['a+b'] = df.apply(lambda x: x.str.cat(sep=''), axis=1)
In [725]: df
Out[725]:
a b a+b
0 asd asdas asdasdas
1 NaN asdas asdas
2 asdsa NaN asdsa
Prefer adding the columns than use apply
method. cuz it’s faster than apply
.
-
Just add the two columns (if you know they are strings)
%timeit df.bio + df.procedure_codes
21.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
-
Use apply
%timeit df[eventcol].apply(lambda x: ''.join(x), axis=1)
13.6 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
-
Use Pandas string methods and cat:
%timeit df[eventcol[0]].str.cat(cols, sep=',')
264 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
-
Using sum (which concatenate strings)
%timeit df[eventcol].sum(axis=1)
509 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
see here for more tests
In my case, I wanted to join more than 2 columns together with a separator (a+b+c)
In [3]:
df = pd.DataFrame({'a':['asd',np.NaN,'asdsa'], 'b':['asdas','asdas',np.NaN], 'c':['as',np.NaN ,'ds']})
In [4]: df
Out[4]:
a b c
0 asd asdas as
1 NaN asdas NaN
2 asdsa NaN ds
The following syntax worked for me:
In [5]: df['d'] = df[['a', 'b', 'c']].fillna('').agg('|'.join, axis=1)
In [6]: df
Out[6]:
a b c d
0 asd asdas as asd|asdas|as
1 NaN asdas NaN |asdas|
2 asdsa NaN ds asdsa||ds