pandas left join and update existing column
Question:
I am new to pandas and can’t seem to get this to work with merge function:
>>> left >>> right
a b c a c d
0 1 4 9 0 1 7 13
1 2 5 10 1 2 8 14
2 3 6 11 2 3 9 15
3 4 7 12
With a left join on column a, I would like to update common columns BY THE JOINED KEYS. Note last value in column c is from LEFT table since there is no match.
>>> final
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NAN
How should I do this with Pandas merge function? Thank you.
Answers:
You can use merge()
between left
and right
with how='left'
on 'a'
column.
In [74]: final = left.merge(right, on='a', how='left')
In [75]: final
Out[75]:
a b c_x c_y d
0 1 4 9 7 13
1 2 5 10 8 14
2 3 6 11 9 15
3 4 7 12 NaN NaN
Replace NaN
value from c_y
with c_x
value
In [76]: final['c'] = final['c_y'].fillna(final['c_x'])
In [77]: final
Out[77]:
a b c_x c_y d c
0 1 4 9 7 13 7
1 2 5 10 8 14 8
2 3 6 11 9 15 9
3 4 7 12 NaN NaN 12
Drop unwanted columns, and you have the result
In [79]: final.drop(['c_x', 'c_y'], axis=1)
Out[79]:
a b d c
0 1 4 13 7
1 2 5 14 8
2 3 6 15 9
3 4 7 NaN 12
Here’s a way to do it with join
:
In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')
In [633]: t
Out[633]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 NaN NaN
Now, we want to set null values of c_right
(which is from the right
dataframe) with values from c
column from the left
dataframe. Updated the below process with a method taking from @John Galt’s answer
In [657]: t['c_right'] = t['c_right'].fillna(t['c'])
In [658]: t
Out[658]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 12 NaN
In [659]: t.drop('c_right', axis=1)
Out[659]:
b c d
a
1 4 9 13
2 5 10 14
3 6 11 15
4 7 12 NaN
One way to do this is to set the a column as the index and update
:
In [11]: left_a = left.set_index('a')
In [12]: right_a = right.set_index('a')
Note: update
only does a left join (not merges), so as well as set_index you also need to include the additional columns not present in left_a
.
In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))
In [14]: res.update(right_a)
In [15]: res.reset_index(inplace=True)
In [16]: res
Out[16]:
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NaN
One other way is to use pd.merge like so:
>>> import pandas as pd
>>> final = pd.merge(left=right, right=left,
how='outer',
left_index=True,
right_index=True,
on=('a', 'c')
).sort_index(axis=1)
>>> final
a b c d
0 1 4 7 13.0
1 2 5 8 14.0
2 3 6 9 15.0
3 4 7 12 NaN
Provide the intersection of both dataframe’s columns to the ‘on=’ parameter of the function.
This does not create unwanted columns that have to be dropped like with Zero’s solution.
The NaN value might change integers to floats in the same column.
Edit: This works for Pandas versions <= 1.1.5
DataFrame.update() is nice, but it doesn’t let you specify columns to join on and more importantly, if the other dataframe has NaN values, those NaN values will not overwrite non-nan values in the original DataFrame. To me, this is undesirable behavior.
Here’s a custom method I rolled to fix these issues. It’s freshly written, so users beware..
join_insertion()
def join_insertion(into_df, from_df, cols, on, by=None, direction=None, mult='error'):
"""
Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
This function allows you to do an operation like:
"where A and B match via the column foo, insert the values of baz and buz from B into A"
Note that this'll update A's values for baz and it'll insert buz as a new column.
This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!
Optionally, direction can be given as 'backward', 'forward', or nearest to implement a rolling join
insertion. forward means 'roll into_df values forward to match from_df values', etc. Additionally,
when doing a rolling join, 'on' should be the roll column and 'by' should be the exact-match columns.
See pandas.merge_asof() for details.
Note that 'mult' gets ignored when doing a rolling join. In the case of a rolling join, the first
appearing record is kept, even if two records match a key from the same distance. Perhaps this
can be improved...
:param into_df: dataframe you want to modify
:param from_df: dataframe with the values you want to insert
:param cols: list of column names (values to insert)
:param on: list of column names (values to join on), or a dict of {into:from} column name pairs
:param by: same format as on; when doing a rolling join insertion, what columns to exact-match on
:param direction: 'forward', 'backward', or 'nearest'. forward means roll into_df values to match from_df
:param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
an error can be raised, or the first matching value can be inserted, or the last matching value
can be inserted
:return: a modified copy of into_df, with updated values using from_df
"""
# Infer left_on, right_on
if (isinstance(on, dict)):
left_on = list(on.keys())
right_on = list(on.values())
elif(isinstance(on, list)):
left_on = on
right_on = on
elif(isinstance(on, str)):
left_on = [on]
right_on = [on]
else:
raise Exception("on should be a list or dictionary")
# Infer left_by, right_by
if(by is not None):
if (isinstance(by, dict)):
left_by = list(by.keys())
right_by = list(by.values())
elif (isinstance(by, list)):
left_by = by
right_by = by
elif (isinstance(by, str)):
left_by = [by]
right_by = [by]
else:
raise Exception("by should be a list or dictionary")
else:
left_by = None
right_by = None
# Make cols a list if it isn't already
if(isinstance(cols, str)):
cols = [cols]
# Setup
A = into_df.copy()
B = from_df[right_on + cols + ([] if right_by is None else right_by)].copy()
# Insert row ids
A['_A_RowId_'] = np.arange(A.shape[0])
B['_B_RowId_'] = np.arange(B.shape[0])
# Merge
if(direction is None):
A = pd.merge(
left=A,
right=B,
how='left',
left_on=left_on,
right_on=right_on,
suffixes=(None, '_y'),
indicator=True
).sort_values(['_A_RowId_', '_B_RowId_'])
# Check for rows of A which got duplicated by the merge, and then handle appropriately
if (mult == 'error'):
if (A.groupby('_A_RowId_').size().max() > 1):
raise Exception("At least one key of into_df matched multiple rows of from_df.")
elif (mult == 'first'):
A = A.groupby('_A_RowId_').first().reset_index()
elif (mult == 'last'):
A = A.groupby('_A_RowId_').last().reset_index()
else:
A.sort_values(left_on, inplace=True)
B.sort_values(right_on, inplace=True)
A = pd.merge_asof(
left=A,
right=B,
direction=direction,
left_on=left_on,
right_on=right_on,
left_by=left_by,
right_by=right_by,
suffixes=(None, '_y')
).sort_values(['_A_RowId_', '_B_RowId_'])
# Insert values from new column(s) into pre-existing column(s)
mask = A._merge == 'both' if direction is None else np.repeat(True, A.shape[0])
cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
for col in cols_in_both:
A.loc[mask, col] = A.loc[mask, col + '_y']
# Drop unwanted columns
A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)
return A
Example Use
into_df = pd.DataFrame({
'foo': [1, 2, 3],
'bar': [4, 5, 6],
'baz': [7, 8, 9]
})
foo bar baz
0 1 4 7
1 2 5 8
2 3 6 9
from_df = pd.DataFrame({
'foo': [1, 3, 5, 7, 3],
'baz': [70, 80, 90, 30, 40],
'buz': [0, 1, 2, 3, 4]
})
foo baz buz
0 1 70 0
1 3 80 1
2 5 90 2
3 7 30 3
4 3 40 4
# Use it!
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
Exception: At least one key of into_df matched multiple rows of from_df.
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 80.0 1.0
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 40.0 4.0
As an aside, this is one of those things I severely miss from R’s data.table package. With data.table, this is as easy as x[y, Foo := i.Foo, on = c("a", "b")]
Here is another way that should work using combine_first()
right.set_index('a').combine_first(left.set_index('a')).reset_index()
I am new to pandas and can’t seem to get this to work with merge function:
>>> left >>> right
a b c a c d
0 1 4 9 0 1 7 13
1 2 5 10 1 2 8 14
2 3 6 11 2 3 9 15
3 4 7 12
With a left join on column a, I would like to update common columns BY THE JOINED KEYS. Note last value in column c is from LEFT table since there is no match.
>>> final
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NAN
How should I do this with Pandas merge function? Thank you.
You can use merge()
between left
and right
with how='left'
on 'a'
column.
In [74]: final = left.merge(right, on='a', how='left')
In [75]: final
Out[75]:
a b c_x c_y d
0 1 4 9 7 13
1 2 5 10 8 14
2 3 6 11 9 15
3 4 7 12 NaN NaN
Replace NaN
value from c_y
with c_x
value
In [76]: final['c'] = final['c_y'].fillna(final['c_x'])
In [77]: final
Out[77]:
a b c_x c_y d c
0 1 4 9 7 13 7
1 2 5 10 8 14 8
2 3 6 11 9 15 9
3 4 7 12 NaN NaN 12
Drop unwanted columns, and you have the result
In [79]: final.drop(['c_x', 'c_y'], axis=1)
Out[79]:
a b d c
0 1 4 13 7
1 2 5 14 8
2 3 6 15 9
3 4 7 NaN 12
Here’s a way to do it with join
:
In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')
In [633]: t
Out[633]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 NaN NaN
Now, we want to set null values of c_right
(which is from the right
dataframe) with values from c
column from the left
dataframe. Updated the below process with a method taking from @John Galt’s answer
In [657]: t['c_right'] = t['c_right'].fillna(t['c'])
In [658]: t
Out[658]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 12 NaN
In [659]: t.drop('c_right', axis=1)
Out[659]:
b c d
a
1 4 9 13
2 5 10 14
3 6 11 15
4 7 12 NaN
One way to do this is to set the a column as the index and update
:
In [11]: left_a = left.set_index('a')
In [12]: right_a = right.set_index('a')
Note: update
only does a left join (not merges), so as well as set_index you also need to include the additional columns not present in left_a
.
In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))
In [14]: res.update(right_a)
In [15]: res.reset_index(inplace=True)
In [16]: res
Out[16]:
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NaN
One other way is to use pd.merge like so:
>>> import pandas as pd
>>> final = pd.merge(left=right, right=left,
how='outer',
left_index=True,
right_index=True,
on=('a', 'c')
).sort_index(axis=1)
>>> final
a b c d
0 1 4 7 13.0
1 2 5 8 14.0
2 3 6 9 15.0
3 4 7 12 NaN
Provide the intersection of both dataframe’s columns to the ‘on=’ parameter of the function.
This does not create unwanted columns that have to be dropped like with Zero’s solution.
The NaN value might change integers to floats in the same column.
Edit: This works for Pandas versions <= 1.1.5
DataFrame.update() is nice, but it doesn’t let you specify columns to join on and more importantly, if the other dataframe has NaN values, those NaN values will not overwrite non-nan values in the original DataFrame. To me, this is undesirable behavior.
Here’s a custom method I rolled to fix these issues. It’s freshly written, so users beware..
join_insertion()
def join_insertion(into_df, from_df, cols, on, by=None, direction=None, mult='error'):
"""
Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
This function allows you to do an operation like:
"where A and B match via the column foo, insert the values of baz and buz from B into A"
Note that this'll update A's values for baz and it'll insert buz as a new column.
This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!
Optionally, direction can be given as 'backward', 'forward', or nearest to implement a rolling join
insertion. forward means 'roll into_df values forward to match from_df values', etc. Additionally,
when doing a rolling join, 'on' should be the roll column and 'by' should be the exact-match columns.
See pandas.merge_asof() for details.
Note that 'mult' gets ignored when doing a rolling join. In the case of a rolling join, the first
appearing record is kept, even if two records match a key from the same distance. Perhaps this
can be improved...
:param into_df: dataframe you want to modify
:param from_df: dataframe with the values you want to insert
:param cols: list of column names (values to insert)
:param on: list of column names (values to join on), or a dict of {into:from} column name pairs
:param by: same format as on; when doing a rolling join insertion, what columns to exact-match on
:param direction: 'forward', 'backward', or 'nearest'. forward means roll into_df values to match from_df
:param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
an error can be raised, or the first matching value can be inserted, or the last matching value
can be inserted
:return: a modified copy of into_df, with updated values using from_df
"""
# Infer left_on, right_on
if (isinstance(on, dict)):
left_on = list(on.keys())
right_on = list(on.values())
elif(isinstance(on, list)):
left_on = on
right_on = on
elif(isinstance(on, str)):
left_on = [on]
right_on = [on]
else:
raise Exception("on should be a list or dictionary")
# Infer left_by, right_by
if(by is not None):
if (isinstance(by, dict)):
left_by = list(by.keys())
right_by = list(by.values())
elif (isinstance(by, list)):
left_by = by
right_by = by
elif (isinstance(by, str)):
left_by = [by]
right_by = [by]
else:
raise Exception("by should be a list or dictionary")
else:
left_by = None
right_by = None
# Make cols a list if it isn't already
if(isinstance(cols, str)):
cols = [cols]
# Setup
A = into_df.copy()
B = from_df[right_on + cols + ([] if right_by is None else right_by)].copy()
# Insert row ids
A['_A_RowId_'] = np.arange(A.shape[0])
B['_B_RowId_'] = np.arange(B.shape[0])
# Merge
if(direction is None):
A = pd.merge(
left=A,
right=B,
how='left',
left_on=left_on,
right_on=right_on,
suffixes=(None, '_y'),
indicator=True
).sort_values(['_A_RowId_', '_B_RowId_'])
# Check for rows of A which got duplicated by the merge, and then handle appropriately
if (mult == 'error'):
if (A.groupby('_A_RowId_').size().max() > 1):
raise Exception("At least one key of into_df matched multiple rows of from_df.")
elif (mult == 'first'):
A = A.groupby('_A_RowId_').first().reset_index()
elif (mult == 'last'):
A = A.groupby('_A_RowId_').last().reset_index()
else:
A.sort_values(left_on, inplace=True)
B.sort_values(right_on, inplace=True)
A = pd.merge_asof(
left=A,
right=B,
direction=direction,
left_on=left_on,
right_on=right_on,
left_by=left_by,
right_by=right_by,
suffixes=(None, '_y')
).sort_values(['_A_RowId_', '_B_RowId_'])
# Insert values from new column(s) into pre-existing column(s)
mask = A._merge == 'both' if direction is None else np.repeat(True, A.shape[0])
cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
for col in cols_in_both:
A.loc[mask, col] = A.loc[mask, col + '_y']
# Drop unwanted columns
A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)
return A
Example Use
into_df = pd.DataFrame({
'foo': [1, 2, 3],
'bar': [4, 5, 6],
'baz': [7, 8, 9]
})
foo bar baz
0 1 4 7
1 2 5 8
2 3 6 9
from_df = pd.DataFrame({
'foo': [1, 3, 5, 7, 3],
'baz': [70, 80, 90, 30, 40],
'buz': [0, 1, 2, 3, 4]
})
foo baz buz
0 1 70 0
1 3 80 1
2 5 90 2
3 7 30 3
4 3 40 4
# Use it!
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
Exception: At least one key of into_df matched multiple rows of from_df.
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 80.0 1.0
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 40.0 4.0
As an aside, this is one of those things I severely miss from R’s data.table package. With data.table, this is as easy as x[y, Foo := i.Foo, on = c("a", "b")]
Here is another way that should work using combine_first()
right.set_index('a').combine_first(left.set_index('a')).reset_index()