Python -How to compare columns from two dataframe and create 3rd with new values?

Question:

I have two dataframes that contains names. What I am need to do is to check which of the names in second dataframe are not present in the first dataframe.
For this example

list1 = ['Mark','Sofi','Joh','Leo','Jason']
df1 = pd.DataFrame(list1, columns =['Names'])

and

list2 = ['Mark','Sofi','David','Matt','Jason']
df2 = pd.DataFrame(list2, columns =['Names'])

So basically I in this simple example we can see that David and Matt from second dataframe do not exist in the first dataframe.

I need programmatically to come up with 3rd dataframe that will have results like this:

Names
David
Matt

My first thought was to try using pandas merge function but I am unable to get the unique set of names from df2 that are not in df1.

Any thoughts on how to do this?

Asked By: Slavisha84

||

Answers:

You can create the 3rd dataframe filtering the 2nd with a condition like this..

df3 = df2[~df2['Names'].isin(df1['Names'])]
Answered By: Pedro Rocha

You can also use merge with indicator:

>>> df1.merge(df2, on='Names', how='outer', indicator='exist')
   Names       exist
0   Mark        both
1   Sofi        both
2    Joh   left_only
3    Leo   left_only
4  Jason        both
5  David  right_only
6   Matt  right_only

>>> (df1.merge(df2, on='Names', how='outer', indicator='exist')
        .loc[lambda x: x.pop('exist') == 'right_only'])
   Names
5  David
6   Matt

Input dataframes:

list1 = ['Mark','Sofi','Joh','Leo','Jason']
df1 = pd.DataFrame(list1, columns =['Names'])

list2 = ['Mark','Sofi','David','Matt','Jason']
df2 = pd.DataFrame(list2, columns =['Names'])
Answered By: Corralien

Here is another approach,

key_diff = set(df2.Names).difference(df1.Names)
where_diff = df2.Names.isin(key_diff)
df3 = df2[where_diff]

Refer this link for more

Answered By: Uchiha012

Using Set Operations

df3 = pd.DataFrame(set(list2) - set(list1), columns= ["Names"])
Answered By: Just James
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.