replacing a value from df1['colA'] with df2['ColB'] using a unique identifier?

Question

Hi I am trying to replace values in a df1 column A with values from df2 column B, by matching them with df2 column A. Basically if the string of row x in df1[‘a’] is equal to a string of row y in df2[‘a’] I want to replace the value of df1[‘a’] with df2[‘b’]. I have tried a couple things but for some reason this isn’t working properly. I also wants to replace values that aren’t in df2[‘a’] with None.

my sample data is:

df1 = pd.DataFrame({'a': ['a','b','a','d','e','f','g', 'h', 'i'],
                    'b': ['alpha', 'alpha', 'alpha', 'beta', 'beta', 'charlie', 'charlie', "alpha", "beta"],
                    'c': ['elephant', "zebra",'elephant', "zebra",'elephant', "zebra",'elephant','elephant', "zebra"]})

df2 = pd.DataFrame({'a': ['a','b','c','d','e','f','g'], 
                    'b': ['alpha', 'alpha', 'alpha', 'beta', 'beta', 'charlie', 'charlie'],
                    'c': ['elephant', "zebra",'elephant', "zebra",'elephant', "zebra",'elephant']})
df1['UID'] = df1['a']+ df1['b']+df1['c']
df2['UID'] = df2['a']+ df2['b']+df2['c']

df1['a'].loc[df1['UID'].isin(df2['UID'])] = df2['c']
animals = ['elephant','zebra']
df1.loc[~df1['a'].isin(animals), "a"] = "None"

This works in my sample data but isn’t working in my actual data set which is much larger. Any ideas on how to do something similar to this?

Asked By: Ahmed Tawakol

||

Source

Answer 1

I think the explanation is not quite correct. Based on your code attempt, I suspect that what you mean is:

For each row i of df1 that matches (for all fields (a, b, c)) a row j of df2, then replace df1.loc[i, 'a'] by df2.loc[j, 'c'].

If that is the correct interpretation of your question, then:

First, it is safer to use a tuple of the row values as UID for the row, instead of the string concatenation: imagine a row '_', 'foo', 'bar' and another '_', 'fooba', 'r' — they are most certainly distinct. The second advantage of tuple is that it works with other types, not just strings. Thus:

df1['UID'] = df1[['a', 'b', 'c']].apply(tuple, axis=1)
df2['UID'] = df2[['a', 'b', 'c']].apply(tuple, axis=1)

Then, the expected result can be obtained by merging on UID:

df = df1.assign(
    a=df1.merge(
        df2[['UID', 'c']], on='UID', how='left',
        suffixes=['', '_y'])['c_y'].fillna('None')
)

>>> df
          a        b         c                     UID
0  elephant    alpha  elephant    (a, alpha, elephant)
1     zebra    alpha     zebra       (b, alpha, zebra)
2  elephant    alpha  elephant    (a, alpha, elephant)
3     zebra     beta     zebra        (d, beta, zebra)
4  elephant     beta  elephant     (e, beta, elephant)
5     zebra  charlie     zebra     (f, charlie, zebra)
6  elephant  charlie  elephant  (g, charlie, elephant)
7      None    alpha  elephant    (h, alpha, elephant)
8      None     beta     zebra        (i, beta, zebra)

Answered By: Pierre D

replacing a value from df1['colA'] with df2['ColB'] using a unique identifier?

Question:

Answers: