Adding new column to merged DataFrame based on pre-merged DataFrames
Question:
I have two DataFrames, df1 and df2. In my code I used Pandas.concat method to find the differences between them.
df1 = pd.read_excel(latest_file, 0)
df2 = pd.read_excel(latest_file, 1)
#Reads first and second sheet inside spreadsheet.
new_dataframe = pd.concat([df1,df2]).drop_duplicates(keep=False)
This works perfectly, however I want to know which rows are coming from df1, and which are coming from df2. to show this I want to add a column to new_dataframe, if it’s from df1 to say "Removed" in the new column, and to say ‘Added’ if it’s from df2. I can’t seem to find any documentation on how to do this. Thanks in advance for any help.
Edit: In my current code it removed all columns which are identical in each DataFrame. The solution has to still remove the common rows.
Answers:
Consider using pd.merge
with indicator=True
instead. This will create a new column named _merge
that indicates which value came from which column. You can modify this to say Removed
and Added
df1 = pd.DataFrame({'col1': [1,2,3,4,5]})
df2 = pd.DataFrame({'col1': [3,4,5,6,7]})
m = {'left_only': 'Removed', 'right_only': 'Added'}
new_dataframe = pd.merge(df1, df2, how='outer', indicator=True)
.query('_merge != "both"')
.replace({'_merge': m})
Output:
col1 _merge
0 1 Removed
1 2 Removed
5 6 Added
6 7 Added
I have two DataFrames, df1 and df2. In my code I used Pandas.concat method to find the differences between them.
df1 = pd.read_excel(latest_file, 0)
df2 = pd.read_excel(latest_file, 1)
#Reads first and second sheet inside spreadsheet.
new_dataframe = pd.concat([df1,df2]).drop_duplicates(keep=False)
This works perfectly, however I want to know which rows are coming from df1, and which are coming from df2. to show this I want to add a column to new_dataframe, if it’s from df1 to say "Removed" in the new column, and to say ‘Added’ if it’s from df2. I can’t seem to find any documentation on how to do this. Thanks in advance for any help.
Edit: In my current code it removed all columns which are identical in each DataFrame. The solution has to still remove the common rows.
Consider using pd.merge
with indicator=True
instead. This will create a new column named _merge
that indicates which value came from which column. You can modify this to say Removed
and Added
df1 = pd.DataFrame({'col1': [1,2,3,4,5]})
df2 = pd.DataFrame({'col1': [3,4,5,6,7]})
m = {'left_only': 'Removed', 'right_only': 'Added'}
new_dataframe = pd.merge(df1, df2, how='outer', indicator=True)
.query('_merge != "both"')
.replace({'_merge': m})
Output:
col1 _merge
0 1 Removed
1 2 Removed
5 6 Added
6 7 Added