replacing a value from df1['colA'] with df2['ColB'] using a unique identifier?
Question:
Hi I am trying to replace values in a df1 column A with values from df2 column B, by matching them with df2 column A. Basically if the string of row x in df1[‘a’] is equal to a string of row y in df2[‘a’] I want to replace the value of df1[‘a’] with df2[‘b’]. I have tried a couple things but for some reason this isn’t working properly. I also wants to replace values that aren’t in df2[‘a’] with None.
my sample data is:
df1 = pd.DataFrame({'a': ['a','b','a','d','e','f','g', 'h', 'i'],
'b': ['alpha', 'alpha', 'alpha', 'beta', 'beta', 'charlie', 'charlie', "alpha", "beta"],
'c': ['elephant', "zebra",'elephant', "zebra",'elephant', "zebra",'elephant','elephant', "zebra"]})
df2 = pd.DataFrame({'a': ['a','b','c','d','e','f','g'],
'b': ['alpha', 'alpha', 'alpha', 'beta', 'beta', 'charlie', 'charlie'],
'c': ['elephant', "zebra",'elephant', "zebra",'elephant', "zebra",'elephant']})
df1['UID'] = df1['a']+ df1['b']+df1['c']
df2['UID'] = df2['a']+ df2['b']+df2['c']
df1['a'].loc[df1['UID'].isin(df2['UID'])] = df2['c']
animals = ['elephant','zebra']
df1.loc[~df1['a'].isin(animals), "a"] = "None"
This works in my sample data but isn’t working in my actual data set which is much larger. Any ideas on how to do something similar to this?
Answers:
I think the explanation is not quite correct. Based on your code attempt, I suspect that what you mean is:
For each row i
of df1
that matches (for all fields (a, b, c)
) a row j
of df2
, then replace df1.loc[i, 'a']
by df2.loc[j, 'c']
.
If that is the correct interpretation of your question, then:
First, it is safer to use a tuple
of the row values as UID
for the row, instead of the string concatenation: imagine a row '_', 'foo', 'bar'
and another '_', 'fooba', 'r'
— they are most certainly distinct. The second advantage of tuple
is that it works with other types, not just strings. Thus:
df1['UID'] = df1[['a', 'b', 'c']].apply(tuple, axis=1)
df2['UID'] = df2[['a', 'b', 'c']].apply(tuple, axis=1)
Then, the expected result can be obtained by merging on UID
:
df = df1.assign(
a=df1.merge(
df2[['UID', 'c']], on='UID', how='left',
suffixes=['', '_y'])['c_y'].fillna('None')
)
>>> df
a b c UID
0 elephant alpha elephant (a, alpha, elephant)
1 zebra alpha zebra (b, alpha, zebra)
2 elephant alpha elephant (a, alpha, elephant)
3 zebra beta zebra (d, beta, zebra)
4 elephant beta elephant (e, beta, elephant)
5 zebra charlie zebra (f, charlie, zebra)
6 elephant charlie elephant (g, charlie, elephant)
7 None alpha elephant (h, alpha, elephant)
8 None beta zebra (i, beta, zebra)
Hi I am trying to replace values in a df1 column A with values from df2 column B, by matching them with df2 column A. Basically if the string of row x in df1[‘a’] is equal to a string of row y in df2[‘a’] I want to replace the value of df1[‘a’] with df2[‘b’]. I have tried a couple things but for some reason this isn’t working properly. I also wants to replace values that aren’t in df2[‘a’] with None.
my sample data is:
df1 = pd.DataFrame({'a': ['a','b','a','d','e','f','g', 'h', 'i'],
'b': ['alpha', 'alpha', 'alpha', 'beta', 'beta', 'charlie', 'charlie', "alpha", "beta"],
'c': ['elephant', "zebra",'elephant', "zebra",'elephant', "zebra",'elephant','elephant', "zebra"]})
df2 = pd.DataFrame({'a': ['a','b','c','d','e','f','g'],
'b': ['alpha', 'alpha', 'alpha', 'beta', 'beta', 'charlie', 'charlie'],
'c': ['elephant', "zebra",'elephant', "zebra",'elephant', "zebra",'elephant']})
df1['UID'] = df1['a']+ df1['b']+df1['c']
df2['UID'] = df2['a']+ df2['b']+df2['c']
df1['a'].loc[df1['UID'].isin(df2['UID'])] = df2['c']
animals = ['elephant','zebra']
df1.loc[~df1['a'].isin(animals), "a"] = "None"
This works in my sample data but isn’t working in my actual data set which is much larger. Any ideas on how to do something similar to this?
I think the explanation is not quite correct. Based on your code attempt, I suspect that what you mean is:
For each row
i
ofdf1
that matches (for all fields(a, b, c)
) a rowj
ofdf2
, then replacedf1.loc[i, 'a']
bydf2.loc[j, 'c']
.
If that is the correct interpretation of your question, then:
First, it is safer to use a tuple
of the row values as UID
for the row, instead of the string concatenation: imagine a row '_', 'foo', 'bar'
and another '_', 'fooba', 'r'
— they are most certainly distinct. The second advantage of tuple
is that it works with other types, not just strings. Thus:
df1['UID'] = df1[['a', 'b', 'c']].apply(tuple, axis=1)
df2['UID'] = df2[['a', 'b', 'c']].apply(tuple, axis=1)
Then, the expected result can be obtained by merging on UID
:
df = df1.assign(
a=df1.merge(
df2[['UID', 'c']], on='UID', how='left',
suffixes=['', '_y'])['c_y'].fillna('None')
)
>>> df
a b c UID
0 elephant alpha elephant (a, alpha, elephant)
1 zebra alpha zebra (b, alpha, zebra)
2 elephant alpha elephant (a, alpha, elephant)
3 zebra beta zebra (d, beta, zebra)
4 elephant beta elephant (e, beta, elephant)
5 zebra charlie zebra (f, charlie, zebra)
6 elephant charlie elephant (g, charlie, elephant)
7 None alpha elephant (h, alpha, elephant)
8 None beta zebra (i, beta, zebra)