Compare two DataFrame columns of lists of strings (A & B) to find if any in B are NOT in A for adding to Col C


d = {'Col A': [['Singapore','Germany','UK'],['Ireland','Japan','Australia'],['India','Korea','Vietnam']], 'Col B': [['Singapore','Germany','UK'],['Ireland','Japan'],['India','Mexico','Argentina']]}

df = pd.DataFrame(data=d)

I’m trying to compare these two columns and return a new column, Col C, that contains any strings that are present in Col B but NOT present in Col A. So row 1 being the same returns no value, row 2 where A contains UK returns no value, but row 3 returns ‘Mexico’ and ‘Argentina’ but not ‘Korea’ or Vietnam.

I’ve tried creating a separate column out of Col A that eliminates countries from Col A that are not present in Col B, like Australia because it’s okay for countries to be in Col A that are NOT in Col B. And then a list comprehension to identify unique strings between the two that can then be added to Col C. But I feel like there must be a simpler method.

Asked By: Michael Kessler



You can use np.setdiff1d.

for index, row in df.iterrows():[index, 'Col C'] = np.setdiff1d(row['Col B'], row['Col A'])
Answered By: Michael Cao