Anti-Join Pandas

Question

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).

I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.

I am familiar with R and this is the code I would use to do this in R.

library("dplyr")

## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")

## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)

How would I achieve this in python?

Asked By: Ayelavan

||

Source

Answer 1

Consider the following dataframes

TableA = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('abcd'), name='Key'),
                      ['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('aecf'), name='Key'),
                      ['A', 'B', 'C']).reset_index()

TableA

TableB

This is one way to do what you want

Method 1

# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)

# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)

Method 2

rows = []
for i, row in TableB.iterrows():
    if row.Key not in TableA.Key.values:
        rows.append(row)

pd.concat([TableA.T] + rows, axis=1).T

Timing

4 rows with 2 overlap

Method 1 is much quicker

10,000 rows 5,000 overlap

loops are bad

Answered By: piRSquared

Answer 2

You’ll have both tables TableA and TableB such that both DataFrame objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.

Then, we want to merge the rows in TableA with the rows in TableB that don’t match any in TableA for a ‘Key’ column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA with the other sB if sB‘s values don’t match sA‘s. The following code solves this exercise:

import pandas as pd

TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])

removeTheseIndexes = []
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same

for i in range(0, len(keyColumnA)):
    firstValue = keyColumnA[i]
    for j in range(0, len(keyColumnB)):
        copycat = keyColumnB[j]
        if firstValue == copycat:
            removeTheseIndexes.append(j)

TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)

Note this affects TableB‘s data as well. You can use inplace=False and re-assign it to a newTable, then TableA.append(newTable) alternatively.

# Table A
   0  1   2
0  2  3   4
1  5  6   7
2  8  9  10

# Table B
   0   1  2
0  1   3  4
1  5   7  8
2  9  10  0

# Set 'Key' column = 1
# Run the script after the loop

# Table A
   0   1   2
0  2   3   4
1  5   6   7
2  8   9  10
3  5   7   8
4  9  10   0

# Table B
   0   1  2
1  5   7  8
2  9  10  0

Answered By: Jossie Calderon

Answer 3

I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:

import pandas as pd
import numpy as np

TableA = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('abcd'), name='Key'),
                      ['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('aecf'), name='Key'),
                      ['A', 'B', 'C']).reset_index()

print('TableA', TableA, sep='n')
print('TableB', TableB, sep='n')

TableB_only = pd.merge(
    TableA, TableB,
    how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
        '_merge == "right_only"')

print('TableB_only', TableB_only, sep='n')

Table_concatenated = pd.concat((TableA, TableB_only), join='inner')

print('Table_concatenated', Table_concatenated, sep='n')

Which prints this output:

TableA
  Key         A         B         C
0   a  0.035548  0.344711  0.860918
1   b  0.640194  0.212250  0.277359
2   c  0.592234  0.113492  0.037444
3   d  0.112271  0.205245  0.227157
TableB
  Key         A         B         C
0   a  0.754538  0.692902  0.537704
1   e  0.499092  0.864145  0.004559
2   c  0.082087  0.682573  0.421654
3   f  0.768914  0.281617  0.924693
TableB_only
  Key  A_foo  B_foo  C_foo         A         B         C      _merge
4   e    NaN    NaN    NaN  0.499092  0.864145  0.004559  right_only
5   f    NaN    NaN    NaN  0.768914  0.281617  0.924693  right_only
Table_concatenated
  Key         A         B         C
0   a  0.035548  0.344711  0.860918
1   b  0.640194  0.212250  0.277359
2   c  0.592234  0.113492  0.037444
3   d  0.112271  0.205245  0.227157
4   e  0.499092  0.864145  0.004559
5   f  0.768914  0.281617  0.924693

Answered By: tommy.carstensen

Answer 4

Easiest answer imaginable:

tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableB, how="left" on="key")

answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]

Should be the fastest proposed as well.

Answered By: Jamie Marshall

Answer 5

Based on one of the other suggestions, here’s a function that should do it. Using only pandas functions, no looping. You can use multiple columns as the key as well. If you change the line output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()] to output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()] you have a semi_join.

def anti_join(tableA,tableB,on):

   #if joining on index, make it into a column
   if tableB.index.name is not None:
       dummy = tableB.reset_index()[on]
   else:
       dummy = tableB[on]

   #create a dummy columns of 1s
   if isinstance(dummy, pd.Series):
       dummy = dummy.to_frame()

   dummy.loc[:,'dummy_col'] = 1

   #preserve the index of tableA if it has one
   if tableA.index.name is not None:
       idx_name = tableA.index.name
       tableA = tableA.reset_index(drop = False)
   else:
       idx_name = None

   #do a left-join
   merged = tableA.merge(dummy,on=on,how='left')

   #keep only the non-matches
   output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]

   #reset the index (if applicable)
   if idx_name is not None:
       output = output.set_index(idx_name)

   return(output)

Answered By: thrillhouse

Answer 6

indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:

left_only
right_only
both

Keep right_only and left_only. That is it.

outer_join = TableA.merge(TableB, how = 'outer', indicator = True)

anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)

easy!

Here is a comparison with a solution from piRSquared:

1) When run on this example matching based on one column, piRSquared’s solution is faster.

2) But it only works for matching on one column. If you want to match on several columns – my solution works just as fine as with one column.

So it’s up for you to decide.

Answered By: Dennis Lyubyvy

Answer 7

One liner

TableA.append(TableB.loc[~TableB.Key.isin(TableA.Key)], ignore_index=True)

%%timeit gives about the same timing as the accepted answer.

Answered By: asiera

Anti-Join Pandas

Question:

Answers:

Method 1

Method 2

Timing