Adding new column to merged DataFrame based on pre-merged DataFrames

Question:

I have two DataFrames, df1 and df2. In my code I used Pandas.concat method to find the differences between them.

df1 = pd.read_excel(latest_file, 0)
df2 = pd.read_excel(latest_file, 1)
#Reads first and second sheet inside spreadsheet.

new_dataframe = pd.concat([df1,df2]).drop_duplicates(keep=False)

This works perfectly, however I want to know which rows are coming from df1, and which are coming from df2. to show this I want to add a column to new_dataframe, if it’s from df1 to say "Removed" in the new column, and to say ‘Added’ if it’s from df2. I can’t seem to find any documentation on how to do this. Thanks in advance for any help.

Edit: In my current code it removed all columns which are identical in each DataFrame. The solution has to still remove the common rows.

Asked By: ijoubert21

||

Answers:

Consider using pd.merge with indicator=True instead. This will create a new column named _merge that indicates which value came from which column. You can modify this to say Removed and Added

df1 = pd.DataFrame({'col1': [1,2,3,4,5]})
df2 = pd.DataFrame({'col1': [3,4,5,6,7]})

m = {'left_only': 'Removed', 'right_only': 'Added'}

new_dataframe = pd.merge(df1, df2, how='outer', indicator=True) 
                  .query('_merge != "both"')  
                  .replace({'_merge': m})

Output:

   col1   _merge
0     1  Removed
1     2  Removed
5     6    Added
6     7    Added
Answered By: Stu Sztukowski