Python Pandas – Merging two Dataframes with new and old rows

Question:

I have two Dataframes with rows with an identical (corresponding) index, which I want to merge. Every row has an update-time. For rows with identical indexes the row with the higher update-time wins. All fields from the ‘newer’ row should be taken, except for the fields were only in the ‘older’ row are values.
Example:

df1 = pd.DataFrame({'Hugo' : {'age' : 21, 'weight' : 75},
                   'Niklas': {'age' : 46, 'weight' : 65},
                   'Ronald' : {'age' : 76, 'weight' : 85, 'height' : 176}}).T
df1.index.names = ['name']
df1['update_time'] = 1

df2 = pd.DataFrame({'Hugo' : {'age' : 22, 'weight' : 77},
                   'Bertram': {'age' : 45, 'weight' : 65, 'height' : 190},
                   'Donald' : {'age' : 75, 'weight' : 85},
                   'Ronald' : {'age' : 77, 'weight' : 84}}).T
df2.index.names = ['name']
df2['update_time'] = 2


df1:
+--------+-------+----------+----------+---------------+
| name   |   age |   height |   weight |   update_time |
|--------+-------+----------+----------+---------------|
| Hugo   |    21 |      nan |       75 |             1 |
| Niklas |    46 |      nan |       65 |             1 |
| Ronald |    76 |      176 |       85 |             1 |
+--------+-------+----------+----------+---------------+
df2:
+---------+-------+----------+---------------+
| name    |   age |   weight |   update_time |
|---------+-------+----------+---------------|
| Bertram |    45 |       65 |             2 |
| Donald  |    75 |       85 |             2 |
| Hugo    |    22 |       77 |             2 |
| Ronald  |    77 |       84 |             2 |
+---------+-------+----------+---------------+

Result should look like this:

+---------+-------+----------+----------+---------------+
| name    |   age |   height |   weight |   update_time |
|---------+-------+----------+----------+---------------|
| Niklas  |    46 |      nan |       65 |             1 |
| Bertram |    45 |      190 |       65 |             2 |
| Donald  |    75 |      nan |       85 |             2 |
| Hugo    |    22 |      nan |       77 |             2 |
| Ronald  |    77 |      176 |       84 |             2 |
+---------+-------+----------+----------+---------------+

How could I do this ? The Problem is to keep the field with the height of Ronald.
If I do first an df.Update of df1 then the timestamp isn’t there anymore and i cannot find the older duplicates.
If I do an df.append I can’t merge the fields.

Asked By: Egirus Ornila

||

Answers:

Use combine_first:

df2.combine_first(df1)

output:

          age  height  weight  update_time
name                                      
Bertram  45.0   190.0    65.0          2.0
Donald   75.0     NaN    85.0          2.0
Hugo     22.0     NaN    77.0          2.0
Niklas   46.0     NaN    65.0          1.0
Ronald   77.0   176.0    84.0          2.0
Answered By: Scott Boston

I am putting my answer here as parity, being another working solution I have used primarily:

rdpz = lambda df: df.groupby(df.index).last()
df3 = rdpz( pd.concat([df1, df2]) )

Here, the function rdpz groups the concatenated list of dataframes by their indices and chooses the last entry (first can be substituted for last if list order is reversed).

As said previously, this is not nearly as clean as Scott Boston‘s answer, but helpful if you want to extend to multiple dataframes or have more tricky multiindexes. In one line, you can use:

df3 = pd.concat([df1, df2]).apply(lambda df: df.groupby(df.index).last())
Answered By: double0darbo
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.