Pandas check if row exist in another dataframe and append index

Question:

I’m having one problem to iterate over my dataframe. The way I’m doing is taking a long time and I don’t have that many rows (I have like 300k rows)

What am I trying to do?

  1. Check if one DF (A) contains the value of two columns of the other DF (B). You can think of this as a multiple-key field

  2. If True, get the index of DF.B and assign to one column of DF.A

  3. If False, two steps:

a. append to DF.B the two columns not found

b. assign the new ID to DF.A (I couldn’t do this one)

This is my code, where:

  1. df is DF.A and df_id is DF.B:

  2. SampleID and ParentID are the two columns I am interested to check if they exist in both dataframes

  3. Real_ID is the column to which I want to assign the id of DF.B (df_id)

for index, row in df.iterrows():
    #check if columns exist in the other dataframe
    real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]
    
    if real_id.empty:
        #row does not exist, append to df_id
        df_id = df_id.append(row[['SampleID','ParentID']])
    else:
        #row exists, assign id of df_id to df
        row['Real_ID'] = real_id.index

EXAMPLE:

DF.A (df)

   Real_ID   SampleID   ParentID  Something AnotherThing
0             20          21          a          b      
1             10          11          a          b      
2             40          51          a          b       

DF.B (df_id)

   SampleID   ParentID  
0    10          11         
1    20          21     

Result:

   Real_ID   SampleID   ParentID  Something AnotherThing
0      1      10          11          a          b      
1      0      20          21          a          b      
2      2      40          51          a          b      


   SampleID   ParentID  
0    20          21         
1    10          11    
2    40          51

Again, this solution is very slow. I’m sure there is a better way to do this and that’s why I’m asking here. Unfortunately this was what I got after some hours…

Thanks

Asked By: renno

||

Answers:

you can do it this way:

Data (pay attention at the index in the B DF):

In [276]: cols = ['SampleID', 'ParentID']

In [277]: A
Out[277]:
   Real_ID  SampleID  ParentID Something AnotherThing
0      NaN        10        11         a            b
1      NaN        20        21         a            b
2      NaN        40        51         a            b

In [278]: B
Out[278]:
   SampleID  ParentID
3        10        11
5        20        21

Solution:

In [279]: merged = pd.merge(A[cols], B, on=cols, how='outer', indicator=True)

In [280]: merged
Out[280]:
   SampleID  ParentID     _merge
0        10        11       both
1        20        21       both
2        40        51  left_only


In [281]: B = pd.concat([B, merged.ix[merged._merge=='left_only', cols]])

In [282]: B
Out[282]:
   SampleID  ParentID
3        10        11
5        20        21
2        40        51

In [285]: A['Real_ID'] = pd.merge(A[cols], B.reset_index(), on=cols)['index']

In [286]: A
Out[286]:
   Real_ID  SampleID  ParentID Something AnotherThing
0        3        10        11         a            b
1        5        20        21         a            b
2        2        40        51         a            b

I agree with the accepted answer but I would strongly advise adding the ignore_index=True parameter to pd.concat(). If you don’t do this the merge step may result in duplicate entries per row in dataframe B. I would reply to the answer directly but I don’t have enough reputation points yet (lol).

Answered By: oatmilk.latte
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.