Find the difference between two columns in a dataframe but keeping the row index avaiable

Question:

I have two dataframes:

df1 = pd.DataFrame({"product":['apples', 'bananas', 'oranges', 'kiwi']})
df2 = pd.Dataframe({"product":['apples', 'aples', 'appples', 'banans', 'oranges', 'kiwki'], "key": [1, 2, 3, 4, 5, 6]})

I want to use something like a set(df2).difference(df1) to find the difference between the product columns but I want to keep the indexes. So ideally the output would look like this

result =[‘aples’, ‘appples’, ‘banans’, ‘kiwki’][2 3 4 6]

Whenever I use the set.difference() I get the list of the different values but I lose the key index.

Asked By: Dominique Barnes

||

Answers:

I guess you are trying to do a left anti join, which means you only want to keep the rows in df2 that aren’t present in df1. In that case:

df1 = pd.DataFrame({"product":['apples', 'bananas', 'oranges', 'kiwi']})
df2 = pd.DataFrame({"product":['apples', 'aples', 'appples', 'banans', 'oranges', 'kiwki'], "key":[1, 2, 3, 4, 5, 6]})

# left join
joined_df = df2.merge(df1, on='product', how='left', indicator=True)
# keeping products that were only present in df2
products_only_in_df2 = joined_df.loc[joined_df['_merge'] == 'left_only', 'product']
# filtering df2 using the above df so we have the keys as well
result = df2[df2['product'].isin(products_only_in_df2)]
Answered By: akapusuz

You have to filter the df2 frame checking if the elements from df2 are not in df1:

df2[~df2["product"].isin(df1['product'])]
  • ~ negates the values of a boolean Series.
  • ser1.isin(ser2) is a boolean Series which gives, for each element of ser 1, whether or not the value can be found in ser2.
Answered By: EEtch
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.