Comparing 2 revisions of excel files in python pandas

Question:

I am very new to pandas. It might be a silly question to some of you.
I am looking to compare 2 excel files and output the changes or the new entries

old.csv

Product  Price  Description
1        1.25   Product 1
2        2.25   Product 2
3        3.25   Product 3

new.csv

Product  Price  Description
1        1.25   Product 1  # Product 2 not in list
3        3.50   Product 3  # Price update
4        4.25   Product 4  # New entry

TRIED

import pandas as pd
import numpy as np
import requests

url = '<SomeUrl>/<PriceList>.xls'

resp = requests.get(url)
df = pd.DataFrame(pd.read_excel(resp.content))
df.to_csv('new.csv')
old = pd.read_csv('old.csv')
new = pd.read_csv('new.csv')
changes = new.loc[new['Price'] != old['Price']] 
changes_csv = changes[['Product', 'Price', 'Description']]
print(changes_csv)

EXPECTING

3        3.50   Product 3  # Price update
4        4.25   Product 4  # New entry

I get the correct results if the length matches exactly. Otherwise I get
pandas valueerror can only compare identically labeled objects

BONUS
It wound be awesome if I can produce output for discontinued products

Asked By: vetcode

||

Answers:

You could create a master index of all products, create 2 old/new dataframes using all the master index, then use df.compare() to compare the two databases:

Import pandas as pd

df1 = pd.DataFrame([[1,1.25,'Product 1'],[2,2.25,'Product 2'],[3,3.25,'Product 3']], columns=['Product','Price','Description'])
df2 = pd.DataFrame([[1,1.25,'Product 1'],[3,3.5,'Product 2'],[4,4.25,'Product 3 Change']], columns=['Product','Price','Description'])

df1product = df1[['Product']]
df2product = df2[['Product']]
dfproducts = df1product.merge(df2product, on='Product', how='outer')

df1 = dfproducts.merge(df1, how='left', on='Product')
df1.set_index(df1['Product'], inplace=True)

df2 = dfproducts.merge(df2, how='left', on='Product')
df2.set_index(df2['Product'], inplace=True)


dfcompare = df1.compare(df2, align_axis=0)
Answered By: WCeconomics

I have solved the problem, even though @WCeconomics kindly took the time to type the code out, it did not help me to get the output I wanted. Likely due to me being a noob with pandas.

This is how I solved it, so as it is useful to the community.

import pandas as pd
import openpyxl # to write excel files
from openpyxl.utils.dataframe import dataframe_to_rows

old = pd.read_excel('old.xls')
new = pd.read_excel('new.xls')
# data for these is in the same format as in question

merged = old.merge(new, on='Product', how='outer', suffixes=('_old', '_new'))
df = merged[['Product', 'Product Description_old', 'Price_old', 'Price_new']]
changes = df.loc[(df['Price_new'] > df['Price_old'])].dropna(how='any', axis=0)
wb = openpyxl.Workbook()
ws = wb.active
for r in dataframe_to_rows(changes, index=False, header=True):
    ws.append(r)
wb.save('avp_changes.xls')
Answered By: vetcode
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.