pandas combine two columns with null values
Question:
I have a df with two columns and I want to combine both columns ignoring the NaN values. The catch is that sometimes both columns have NaN values in which case I want the new column to also have NaN. Here’s the example:
df = pd.DataFrame({'foodstuff':['apple-martini', 'apple-pie', None, None, None], 'type':[None, None, 'strawberry-tart', 'dessert', None]})
df
Out[10]:
foodstuff type
0 apple-martini None
1 apple-pie None
2 None strawberry-tart
3 None dessert
4 None None
I tried to use fillna
and solve this :
df['foodstuff'].fillna('') + df['type'].fillna('')
and I got :
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
4
dtype: object
The row 4 has become a blank value. What I want in this situation is a NaN value since both the combining columns are NaNs.
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
4 None
dtype: object
Answers:
You can always fill the empty string in the new column with None
import numpy as np
df['new_col'].replace(r'^s*$', np.nan, regex=True, inplace=True)
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'foodstuff':['apple-martini', 'apple-pie', None, None, None], 'type':[None, None, 'strawberry-tart', 'dessert', None]})
df['new_col'] = df['foodstuff'].fillna('') + df['type'].fillna('')
df['new_col'].replace(r'^s*$', np.nan, regex=True, inplace=True)
df
output:
foodstuff type new_col
0 apple-martini None apple-martini
1 apple-pie None apple-pie
2 None strawberry-tart strawberry-tart
3 None dessert dessert
4 None None NaN
Use fillna
on one column with the fill values being the other column:
df['foodstuff'].fillna(df['type'])
The resulting output:
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
4 None
fillna
both columns together
sum(1)
to add them
replace('', np.nan)
df.fillna('').sum(1).replace('', np.nan)
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
4 NaN
dtype: object
you can use the combine
method with a lambda
:
df['foodstuff'].combine(df['type'], lambda a, b: ((a or "") + (b or "")) or None, None)
(a or "")
returns ""
if a is None
then the same logic is applied on the concatenation (where the result would be None
if the concatenation is an empty string).
You can replace the non zero values with column names like:
df1= df.replace(1, pd.Series(df.columns, df.columns))
Afterwards, replace 0’s with empty string and then merge the columns like below:
f = f.replace(0, '')
f['new'] = f.First+f.Second+f.Three+f.Four
Refer the full code below:
import pandas as pd
df = pd.DataFrame({'Second':[0,1,0,0],'First':[1,0,0,0],'Three':[0,0,1,0],'Four':[0,0,0,1], 'cl': ['3D', 'Wireless','Accounting','cisco']})
df2=pd.DataFrame({'pi':['Accounting','cisco','3D','Wireless']})
df1= df.replace(1, pd.Series(df.columns, df.columns))
f = pd.merge(df1,df2,how='right',left_on=['cl'],right_on=['pi'])
f = f.replace(0, '')
f['new'] = f.First+f.Second+f.Three+f.Four
df1
:
In [3]: df1
Out[3]:
Second First Three Four cl
0 0 First 0 0 3D
1 Second 0 0 0 Wireless
2 0 0 Three 0 Accounting
3 0 0 0 Four cisco
df2
:
In [4]: df2
Out[4]:
pi
0 Accounting
1 cisco
2 3D
3 Wireless
Final DataFrame f
will be:
In [2]: f
Out[2]:
Second First Three Four cl pi new
0 First 3D 3D First
1 Second Wireless Wireless Second
2 Three Accounting Accounting Three
3 Four cisco cisco Four
We can make this problem even more complete and have a universal solution for this type of problem.
The key things in there are that we wish to join a group of columns together but just ignore NaN
s.
Here is my answer:
df = pd.DataFrame({'foodstuff':['apple-martini', 'apple-pie', None, None, None],
'type':[None, None, 'strawberry-tart', 'dessert', None],
'type1':[98324, None, None, 'banan', None],
'type2':[3, None, 'strawberry-tart', np.nan, None]})
df=df.fillna("NAN")
df=df.astype('str')
df["output"] = df[['foodstuff', 'type', 'type1', 'type2']].agg(', '.join, axis=1)
df['output'] = df['output'].str.replace('NAN, ', '')
df['output'] = df['output'].str.replace(', NAN', '')
If you deal with columns that contain something where the others don’t and vice-versa, a one-liner that does well the job is
>>> df.rename(columns={'type': 'foodstuff'}).stack().unstack()
foodstuff
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
… which solution also generalises well if you have multiple columns to "intricate", as long as you can define your ~.rename
mapping. The intention behind such renaming is to create duplicates that ~.stack().unstack()
will then process for you.
As explained, this solution only suits configuration with orthogonal columns, i.e. columns that never are simultaneously valued.
With combine_first
you can fill null values in one column with non-null values from another column:
In [3]: df['foodstuff'].combine_first(df['type'])
Out[3]:
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
4 None
I have a df with two columns and I want to combine both columns ignoring the NaN values. The catch is that sometimes both columns have NaN values in which case I want the new column to also have NaN. Here’s the example:
df = pd.DataFrame({'foodstuff':['apple-martini', 'apple-pie', None, None, None], 'type':[None, None, 'strawberry-tart', 'dessert', None]})
df
Out[10]:
foodstuff type
0 apple-martini None
1 apple-pie None
2 None strawberry-tart
3 None dessert
4 None None
I tried to use fillna
and solve this :
df['foodstuff'].fillna('') + df['type'].fillna('')
and I got :
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
4
dtype: object
The row 4 has become a blank value. What I want in this situation is a NaN value since both the combining columns are NaNs.
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
4 None
dtype: object
You can always fill the empty string in the new column with None
import numpy as np
df['new_col'].replace(r'^s*$', np.nan, regex=True, inplace=True)
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'foodstuff':['apple-martini', 'apple-pie', None, None, None], 'type':[None, None, 'strawberry-tart', 'dessert', None]})
df['new_col'] = df['foodstuff'].fillna('') + df['type'].fillna('')
df['new_col'].replace(r'^s*$', np.nan, regex=True, inplace=True)
df
output:
foodstuff type new_col
0 apple-martini None apple-martini
1 apple-pie None apple-pie
2 None strawberry-tart strawberry-tart
3 None dessert dessert
4 None None NaN
Use fillna
on one column with the fill values being the other column:
df['foodstuff'].fillna(df['type'])
The resulting output:
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
4 None
fillna
both columns togethersum(1)
to add themreplace('', np.nan)
df.fillna('').sum(1).replace('', np.nan)
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
4 NaN
dtype: object
you can use the combine
method with a lambda
:
df['foodstuff'].combine(df['type'], lambda a, b: ((a or "") + (b or "")) or None, None)
(a or "")
returns ""
if a is None
then the same logic is applied on the concatenation (where the result would be None
if the concatenation is an empty string).
You can replace the non zero values with column names like:
df1= df.replace(1, pd.Series(df.columns, df.columns))
Afterwards, replace 0’s with empty string and then merge the columns like below:
f = f.replace(0, '')
f['new'] = f.First+f.Second+f.Three+f.Four
Refer the full code below:
import pandas as pd
df = pd.DataFrame({'Second':[0,1,0,0],'First':[1,0,0,0],'Three':[0,0,1,0],'Four':[0,0,0,1], 'cl': ['3D', 'Wireless','Accounting','cisco']})
df2=pd.DataFrame({'pi':['Accounting','cisco','3D','Wireless']})
df1= df.replace(1, pd.Series(df.columns, df.columns))
f = pd.merge(df1,df2,how='right',left_on=['cl'],right_on=['pi'])
f = f.replace(0, '')
f['new'] = f.First+f.Second+f.Three+f.Four
df1
:
In [3]: df1
Out[3]:
Second First Three Four cl
0 0 First 0 0 3D
1 Second 0 0 0 Wireless
2 0 0 Three 0 Accounting
3 0 0 0 Four cisco
df2
:
In [4]: df2
Out[4]:
pi
0 Accounting
1 cisco
2 3D
3 Wireless
Final DataFrame f
will be:
In [2]: f
Out[2]:
Second First Three Four cl pi new
0 First 3D 3D First
1 Second Wireless Wireless Second
2 Three Accounting Accounting Three
3 Four cisco cisco Four
We can make this problem even more complete and have a universal solution for this type of problem.
The key things in there are that we wish to join a group of columns together but just ignore NaN
s.
Here is my answer:
df = pd.DataFrame({'foodstuff':['apple-martini', 'apple-pie', None, None, None],
'type':[None, None, 'strawberry-tart', 'dessert', None],
'type1':[98324, None, None, 'banan', None],
'type2':[3, None, 'strawberry-tart', np.nan, None]})
df=df.fillna("NAN")
df=df.astype('str')
df["output"] = df[['foodstuff', 'type', 'type1', 'type2']].agg(', '.join, axis=1)
df['output'] = df['output'].str.replace('NAN, ', '')
df['output'] = df['output'].str.replace(', NAN', '')
If you deal with columns that contain something where the others don’t and vice-versa, a one-liner that does well the job is
>>> df.rename(columns={'type': 'foodstuff'}).stack().unstack()
foodstuff
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
… which solution also generalises well if you have multiple columns to "intricate", as long as you can define your ~.rename
mapping. The intention behind such renaming is to create duplicates that ~.stack().unstack()
will then process for you.
As explained, this solution only suits configuration with orthogonal columns, i.e. columns that never are simultaneously valued.
With combine_first
you can fill null values in one column with non-null values from another column:
In [3]: df['foodstuff'].combine_first(df['type'])
Out[3]:
0 apple-martini
1 apple-pie
2 strawberry-tart
3 dessert
4 None