Pandas fill missing values in dataframe from another dataframe
Question:
I cannot find a pandas function (which I had seen before) to substitute the NaN’s in a dataframe with values from another dataframe (assuming a common index which can be specified). Any help?
Answers:
If you have two DataFrames of the same shape, then:
df[df.isnull()] = d2
Will do the trick.
Only locations where df.isnull()
evaluates to True
(highlighted in green) will be eligible for assignment.
In practice, the DataFrames aren’t always the same size / shape, and transforming methods (especially .shift()
) are useful.
Data coming in is invariably dirty, incomplete, or inconsistent. Par for the course. There’s a pretty extensive pandas tutorial and associated cookbook for dealing with these situations.
As I just learned, there is a DataFrame.combine_first()
method, which does precisely this, with the additional property that if your updating data frame d2
is bigger than your original df
, the additional rows and columns are added, as well.
df = df.combine_first(d2)
DataFrame.combine_first() answers this question exactly.
However, sometimes you want to fill/replace/overwrite some of the non-missing (non-NaN) values of DataFrame A with values from DataFrame B. That question brought me to this page, and the solution is DataFrame.mask()
A = B.mask(condition, A)
When condition
is true, the values from A will be used, otherwise B’s values will be used.
For example, you could solve the OP’s original question with mask
such that when an element from A is non-NaN, use it, otherwise use the corresponding element from B.
But using DataFrame.mask() you could replace the values of A that fail to meet arbitrary criteria (less than zero? more than 100?) with values from B. So mask
is more flexible, and overkill for this problem, but I thought it was worthy of mention (I needed it to solve my problem).
It’s also important to note that B could be a numpy array instead of a DataFrame. DataFrame.combine_first() requires that B be a DataFrame, but DataFrame.mask() just requires that B’s is an NDFrame and its dimensions match A’s dimensions.
This should be as simple as
df.fillna(d2)
A dedicated method for this is DataFrame.update
:
Quoted from the documentation:
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
Important to note is that this method will modify your data inplace. So it will overwrite your updated dataframe.
Example:
print(df1)
A B C
aaa NaN 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN NaN NaN
print(df2)
A B C
index
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
eee NaN 1.0 NaN
# update df1 NaN where there are values in df2
df1.update(df2)
print(df1)
A B C
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN 1.0 NaN
Notice the updated NaN
values at intersect aaa, A
and eee, B
One important info missing from the other answers is that both combine_first
and fillna
match on index, so you have to make the indices of match across the DataFrames for these methods to work.
Oftentimes, there’s a need to match on some other column(s) to fill in missing values. In that case, you need to use set_index
first to make the columns to be matched, the index.
df1 = df1.set_index(cols_to_be_matched).fillna(df2.set_index(cols_to_be_matched)).reset_index()
or
df1 = df1.set_index(cols_to_be_matched).combine_first(df2.set_index(cols_to_be_matched)).reset_index()
Another option is to use merge
:
df1 = (df1.merge(df2, on=cols_to_be_matched, how='left', suffixes=('','x00'))
.sort_index(axis=1).bfill(axis=1)[df.columns])
The idea here is to left-merge and by sorting the columns (we use 'x00'
as the suffix for columns from df2
since it’s the character with the lowest Unicode value), we make sure the same column values end up next to each other. Then use bfill
horizontally to update df1
with values from df2
.
Example:
Suppose you had df1
:
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b NaN 1
2 2 b NaN 2
3 2 b NaN 3
and df2
C1 C2 C3
0 1 b 2
1 2 b 3
and you want to fill in the missing values in df1
with values in df2
for each pair of C1
–C2
value pair. Then
cols_to_be_matched = ['C1', 'C2']
and all of the codes above produce the following output (where the values are indeed filled as required):
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b 2.0 1
2 2 b 3.0 2
3 2 b 3.0 3
I cannot find a pandas function (which I had seen before) to substitute the NaN’s in a dataframe with values from another dataframe (assuming a common index which can be specified). Any help?
If you have two DataFrames of the same shape, then:
df[df.isnull()] = d2
Will do the trick.
Only locations where df.isnull()
evaluates to True
(highlighted in green) will be eligible for assignment.
In practice, the DataFrames aren’t always the same size / shape, and transforming methods (especially .shift()
) are useful.
Data coming in is invariably dirty, incomplete, or inconsistent. Par for the course. There’s a pretty extensive pandas tutorial and associated cookbook for dealing with these situations.
As I just learned, there is a DataFrame.combine_first()
method, which does precisely this, with the additional property that if your updating data frame d2
is bigger than your original df
, the additional rows and columns are added, as well.
df = df.combine_first(d2)
DataFrame.combine_first() answers this question exactly.
However, sometimes you want to fill/replace/overwrite some of the non-missing (non-NaN) values of DataFrame A with values from DataFrame B. That question brought me to this page, and the solution is DataFrame.mask()
A = B.mask(condition, A)
When condition
is true, the values from A will be used, otherwise B’s values will be used.
For example, you could solve the OP’s original question with mask
such that when an element from A is non-NaN, use it, otherwise use the corresponding element from B.
But using DataFrame.mask() you could replace the values of A that fail to meet arbitrary criteria (less than zero? more than 100?) with values from B. So mask
is more flexible, and overkill for this problem, but I thought it was worthy of mention (I needed it to solve my problem).
It’s also important to note that B could be a numpy array instead of a DataFrame. DataFrame.combine_first() requires that B be a DataFrame, but DataFrame.mask() just requires that B’s is an NDFrame and its dimensions match A’s dimensions.
This should be as simple as
df.fillna(d2)
A dedicated method for this is DataFrame.update
:
Quoted from the documentation:
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
Important to note is that this method will modify your data inplace. So it will overwrite your updated dataframe.
Example:
print(df1)
A B C
aaa NaN 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN NaN NaN
print(df2)
A B C
index
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
eee NaN 1.0 NaN
# update df1 NaN where there are values in df2
df1.update(df2)
print(df1)
A B C
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN 1.0 NaN
Notice the updated NaN
values at intersect aaa, A
and eee, B
One important info missing from the other answers is that both combine_first
and fillna
match on index, so you have to make the indices of match across the DataFrames for these methods to work.
Oftentimes, there’s a need to match on some other column(s) to fill in missing values. In that case, you need to use set_index
first to make the columns to be matched, the index.
df1 = df1.set_index(cols_to_be_matched).fillna(df2.set_index(cols_to_be_matched)).reset_index()
or
df1 = df1.set_index(cols_to_be_matched).combine_first(df2.set_index(cols_to_be_matched)).reset_index()
Another option is to use merge
:
df1 = (df1.merge(df2, on=cols_to_be_matched, how='left', suffixes=('','x00'))
.sort_index(axis=1).bfill(axis=1)[df.columns])
The idea here is to left-merge and by sorting the columns (we use 'x00'
as the suffix for columns from df2
since it’s the character with the lowest Unicode value), we make sure the same column values end up next to each other. Then use bfill
horizontally to update df1
with values from df2
.
Example:
Suppose you had df1
:
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b NaN 1
2 2 b NaN 2
3 2 b NaN 3
and df2
C1 C2 C3
0 1 b 2
1 2 b 3
and you want to fill in the missing values in df1
with values in df2
for each pair of C1
–C2
value pair. Then
cols_to_be_matched = ['C1', 'C2']
and all of the codes above produce the following output (where the values are indeed filled as required):
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b 2.0 1
2 2 b 3.0 2
3 2 b 3.0 3