join or merge with overwrite in pandas
Question:
I want to perform a join/merge/append operation on a dataframe with datetime index.
Let’s say I have df1
and I want to add df2
to it. df2
can have fewer or more columns, and overlapping indexes. For all rows where the indexes match, if df2
has the same column as df1
, I want the values of df1
be overwritten with those from df2
.
How can I obtain the desired result?
Answers:
How about: df2.combine_first(df1)
?
In [33]: df2
Out[33]:
A B C D
2000-01-03 0.638998 1.277361 0.193649 0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05 0.435507 -0.025162 -1.112890 0.324111
2000-01-06 -0.210756 -1.027164 0.036664 0.884715
2000-01-07 -0.821631 -0.700394 -0.706505 1.193341
2000-01-10 1.015447 -0.909930 0.027548 0.258471
2000-01-11 -0.497239 -0.979071 -0.461560 0.447598
In [34]: df1
Out[34]:
A B C
2000-01-03 2.288863 0.188175 -0.040928
2000-01-04 0.159107 -0.666861 -0.551628
2000-01-05 -0.356838 -0.231036 -1.211446
2000-01-06 -0.866475 1.113018 -0.001483
2000-01-07 0.303269 0.021034 0.471715
2000-01-10 1.149815 0.686696 -1.230991
2000-01-11 -1.296118 -0.172950 -0.603887
2000-01-12 -1.034574 -0.523238 0.626968
2000-01-13 -0.193280 1.857499 -0.046383
2000-01-14 -1.043492 -0.820525 0.868685
In [35]: df2.comb
df2.combine df2.combineAdd df2.combine_first df2.combineMult
In [35]: df2.combine_first(df1)
Out[35]:
A B C D
2000-01-03 0.638998 1.277361 0.193649 0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05 0.435507 -0.025162 -1.112890 0.324111
2000-01-06 -0.210756 -1.027164 0.036664 0.884715
2000-01-07 -0.821631 -0.700394 -0.706505 1.193341
2000-01-10 1.015447 -0.909930 0.027548 0.258471
2000-01-11 -0.497239 -0.979071 -0.461560 0.447598
2000-01-12 -1.034574 -0.523238 0.626968 NaN
2000-01-13 -0.193280 1.857499 -0.046383 NaN
2000-01-14 -1.043492 -0.820525 0.868685 NaN
Note that it takes the values from df1
for indices that do not overlap with df2
. If this doesn’t do exactly what you want I would be willing to improve this function / add options to it.
For a merge like this, the update
method of a DataFrame is useful.
Taking the examples from the documentation:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, 2.1, np.nan],
[np.nan, 7., np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]],
index=[1, 2])
Data before the update
:
>>> df1
0 1 2
0 NaN 3.0 5.0
1 -4.6 2.1 NaN
2 NaN 7.0 NaN
>>>
>>> df2
0 1 2
1 -42.6 NaN -8.2
2 -5.0 1.6 4.0
Let’s update df1
with data from df2
:
df1.update(df2)
Data after the update:
>>> df1
0 1 2
0 NaN 3.0 5.0
1 -42.6 2.1 -8.2
2 -5.0 1.6 4.0
Remarks:
- It’s important to notice that this is an operation “in place”, modifying the DataFrame that calls
update
.
- Also note that non NaN values in
df1
are not overwritten with NaN values in df2
I want to perform a join/merge/append operation on a dataframe with datetime index.
Let’s say I have df1
and I want to add df2
to it. df2
can have fewer or more columns, and overlapping indexes. For all rows where the indexes match, if df2
has the same column as df1
, I want the values of df1
be overwritten with those from df2
.
How can I obtain the desired result?
How about: df2.combine_first(df1)
?
In [33]: df2
Out[33]:
A B C D
2000-01-03 0.638998 1.277361 0.193649 0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05 0.435507 -0.025162 -1.112890 0.324111
2000-01-06 -0.210756 -1.027164 0.036664 0.884715
2000-01-07 -0.821631 -0.700394 -0.706505 1.193341
2000-01-10 1.015447 -0.909930 0.027548 0.258471
2000-01-11 -0.497239 -0.979071 -0.461560 0.447598
In [34]: df1
Out[34]:
A B C
2000-01-03 2.288863 0.188175 -0.040928
2000-01-04 0.159107 -0.666861 -0.551628
2000-01-05 -0.356838 -0.231036 -1.211446
2000-01-06 -0.866475 1.113018 -0.001483
2000-01-07 0.303269 0.021034 0.471715
2000-01-10 1.149815 0.686696 -1.230991
2000-01-11 -1.296118 -0.172950 -0.603887
2000-01-12 -1.034574 -0.523238 0.626968
2000-01-13 -0.193280 1.857499 -0.046383
2000-01-14 -1.043492 -0.820525 0.868685
In [35]: df2.comb
df2.combine df2.combineAdd df2.combine_first df2.combineMult
In [35]: df2.combine_first(df1)
Out[35]:
A B C D
2000-01-03 0.638998 1.277361 0.193649 0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05 0.435507 -0.025162 -1.112890 0.324111
2000-01-06 -0.210756 -1.027164 0.036664 0.884715
2000-01-07 -0.821631 -0.700394 -0.706505 1.193341
2000-01-10 1.015447 -0.909930 0.027548 0.258471
2000-01-11 -0.497239 -0.979071 -0.461560 0.447598
2000-01-12 -1.034574 -0.523238 0.626968 NaN
2000-01-13 -0.193280 1.857499 -0.046383 NaN
2000-01-14 -1.043492 -0.820525 0.868685 NaN
Note that it takes the values from df1
for indices that do not overlap with df2
. If this doesn’t do exactly what you want I would be willing to improve this function / add options to it.
For a merge like this, the update
method of a DataFrame is useful.
Taking the examples from the documentation:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, 2.1, np.nan],
[np.nan, 7., np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]],
index=[1, 2])
Data before the update
:
>>> df1
0 1 2
0 NaN 3.0 5.0
1 -4.6 2.1 NaN
2 NaN 7.0 NaN
>>>
>>> df2
0 1 2
1 -42.6 NaN -8.2
2 -5.0 1.6 4.0
Let’s update df1
with data from df2
:
df1.update(df2)
Data after the update:
>>> df1
0 1 2
0 NaN 3.0 5.0
1 -42.6 2.1 -8.2
2 -5.0 1.6 4.0
Remarks:
- It’s important to notice that this is an operation “in place”, modifying the DataFrame that calls
update
. - Also note that non NaN values in
df1
are not overwritten with NaN values indf2