pandas DataFrame concat / update ("upsert")?
Question:
I am looking for an elegant way to append all the rows from one DataFrame to another DataFrame (both DataFrames having the same index and column structure), but in cases where the same index value appears in both DataFrames, use the row from the second data frame.
So, for example, if I start with:
df1:
A B
date
'2015-10-01' 'A1' 'B1'
'2015-10-02' 'A2' 'B2'
'2015-10-03' 'A3' 'B3'
df2:
date A B
'2015-10-02' 'a1' 'b1'
'2015-10-03' 'a2' 'b2'
'2015-10-04' 'a3' 'b3'
I would like the result to be:
A B
date
'2015-10-01' 'A1' 'B1'
'2015-10-02' 'a1' 'b1'
'2015-10-03' 'a2' 'b2'
'2015-10-04' 'a3' 'b3'
This is analogous to what I think is called “upsert” in some SQL systems — a combination of update and insert, in the sense that each row from df2
is either (a) used to update an existing row in df1
if the row key already exists in df1
, or (b) inserted into df1
at the end if the row key does not already exist.
I have come up with the following
pd.concat([df1, df2]) # concat the two DataFrames
.reset_index() # turn 'date' into a regular column
.groupby('date') # group rows by values in the 'date' column
.tail(1) # take the last row in each group
.set_index('date') # restore 'date' as the index
which seems to work, but this relies on the order of the rows in each groupby group always being the same as the original DataFrames, which I haven’t checked on, and seems displeasingly convoluted.
Does anyone have any ideas for a more straightforward solution?
Answers:
One solution is to conatenate df1
with new rows in df2
(i.e. where the index does not match). Then update the values with those from df2
.
df = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df.update(df2)
>>> df
A B
2015-10-01 A1 B1
2015-10-02 a1 b1
2015-10-03 a2 b2
2015-10-04 a3 b3
EDIT:
Per the suggestion of @chrisb, this can further be simplified as follows:
pd.concat([df1[~df1.index.isin(df2.index)], df2])
Thanks Chris!
In addition to the correct answer, watch out for columns that do not exist in both dataframes:
df1 = pd.DataFrame([['test',1, True], ['test2',2, True]]).set_index(0)
df2 = pd.DataFrame([['test2',4], ['test3',3]]).set_index(0)
If you just use the aforementioned solution as-is, you get:
>>> 1 2
0
test 1 True
test2 4 NaN
test3 3 NaN
But if you are expecting the following output:
>>> 1 2
0
test 1 True
test2 4 True
test3 3 NaN
Just change the statement to:
df1 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df1.update(df2)
As of pandas 1.0.3
, the desired functionality is directly given by combine_first
:
combined = df2.combine_first(df1)
print(combined)
# A B
# 2015-10-01 A1 B1
# 2015-10-02 a1 b1
# 2015-10-03 a2 b2
# 2015-10-04 a3 b3
To get this behaviour, the dataframe whose data has priority (the updating one, in this case df2
) must be the one calling the function.
It basically: (1) harmonizes rows and columns, (2) gives priority to non-NaN data, and (3) if datapoints defined in both dataframes, gives priority to data in df2
, which is essentially what you want.
EDIT: My understanding is that combine_first
does fulfill the requested "update-if-present-insert-if-absent" behaviour. However, according to Vijchti in the comments (thanks), this does not strictly correspond to an SQL UPSERT manipulation, as the logic is applied value-by-value instead of the entire row. I removed any reference to UPSERT from my answer.
I am looking for an elegant way to append all the rows from one DataFrame to another DataFrame (both DataFrames having the same index and column structure), but in cases where the same index value appears in both DataFrames, use the row from the second data frame.
So, for example, if I start with:
df1:
A B
date
'2015-10-01' 'A1' 'B1'
'2015-10-02' 'A2' 'B2'
'2015-10-03' 'A3' 'B3'
df2:
date A B
'2015-10-02' 'a1' 'b1'
'2015-10-03' 'a2' 'b2'
'2015-10-04' 'a3' 'b3'
I would like the result to be:
A B
date
'2015-10-01' 'A1' 'B1'
'2015-10-02' 'a1' 'b1'
'2015-10-03' 'a2' 'b2'
'2015-10-04' 'a3' 'b3'
This is analogous to what I think is called “upsert” in some SQL systems — a combination of update and insert, in the sense that each row from df2
is either (a) used to update an existing row in df1
if the row key already exists in df1
, or (b) inserted into df1
at the end if the row key does not already exist.
I have come up with the following
pd.concat([df1, df2]) # concat the two DataFrames
.reset_index() # turn 'date' into a regular column
.groupby('date') # group rows by values in the 'date' column
.tail(1) # take the last row in each group
.set_index('date') # restore 'date' as the index
which seems to work, but this relies on the order of the rows in each groupby group always being the same as the original DataFrames, which I haven’t checked on, and seems displeasingly convoluted.
Does anyone have any ideas for a more straightforward solution?
One solution is to conatenate df1
with new rows in df2
(i.e. where the index does not match). Then update the values with those from df2
.
df = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df.update(df2)
>>> df
A B
2015-10-01 A1 B1
2015-10-02 a1 b1
2015-10-03 a2 b2
2015-10-04 a3 b3
EDIT:
Per the suggestion of @chrisb, this can further be simplified as follows:
pd.concat([df1[~df1.index.isin(df2.index)], df2])
Thanks Chris!
In addition to the correct answer, watch out for columns that do not exist in both dataframes:
df1 = pd.DataFrame([['test',1, True], ['test2',2, True]]).set_index(0)
df2 = pd.DataFrame([['test2',4], ['test3',3]]).set_index(0)
If you just use the aforementioned solution as-is, you get:
>>> 1 2
0
test 1 True
test2 4 NaN
test3 3 NaN
But if you are expecting the following output:
>>> 1 2
0
test 1 True
test2 4 True
test3 3 NaN
Just change the statement to:
df1 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df1.update(df2)
As of pandas 1.0.3
, the desired functionality is directly given by combine_first
:
combined = df2.combine_first(df1)
print(combined)
# A B
# 2015-10-01 A1 B1
# 2015-10-02 a1 b1
# 2015-10-03 a2 b2
# 2015-10-04 a3 b3
To get this behaviour, the dataframe whose data has priority (the updating one, in this case df2
) must be the one calling the function.
It basically: (1) harmonizes rows and columns, (2) gives priority to non-NaN data, and (3) if datapoints defined in both dataframes, gives priority to data in df2
, which is essentially what you want.
EDIT: My understanding is that combine_first
does fulfill the requested "update-if-present-insert-if-absent" behaviour. However, according to Vijchti in the comments (thanks), this does not strictly correspond to an SQL UPSERT manipulation, as the logic is applied value-by-value instead of the entire row. I removed any reference to UPSERT from my answer.