Pandas percentage change matrix
Question:
I have a data frame:
product  cost  

0  product a  56 
1  product b  59 
2  product c  104 
I’d like to make a percentage change matrix like:
product a  product b  product c  

product a  5.08%  46.15%  
product b  5.36%  43.30%  
product c  85.71%  76.27% 
There could be n number of products.

How do I this using pandas?

How do I get the highest / lowest percentage change products? i.e. Highest: product a vs. product c. Lowest: product c vs. product a.
Thank you for your help.
Answers:
Use numpy broadcasting:
# convert columns to arrays
idx = df['product'].to_numpy()
cost = df['cost'].to_numpy()
# compute the percentage change using broadcasting
# convert to DataFrame
out = pd.DataFrame(((cost[:,None]cost)/cost*100).round(2),
index=idx, columns=idx)
# optional, set NaNs in the diagonal
np.fill_diagonal(out.values, np.nan)
print(out)
Output:
product a product b product c
product a NaN 5.08 46.15
product b 5.36 NaN 43.27
product c 85.71 76.27 NaN
question 1
Here is a short way to do the math
import pandas as pd
import numpy as np
df = pd.DataFrame([
["product a", 56],
["product b", 59],
["product c", 104]
], columns=["product", "cost"])
m = pd.DataFrame(
data=np.array(df.cost) * np.ones((3, 3)),
index=df["product"],
columns=df["product"],
)
m.index.name = None
m.columns.name = None
m = (m.Tm) / m # this is where the actual calculation happens
m
result is
question 2
# products with largest change (looks complicated to avoid that product a is compared to itself)
(m + np.diag(np.full(len(df),np.inf))).idxmax(axis=0)
# products with smallest change
(m + np.diag(np.full(len(df),np.inf))).idxmin(axis=0)
edit
OP asks for the single highest / lowest value in matrix m
# index of largest value
(m + np.diag(np.full(len(df), np.inf))).stack().idxmax()
# index of smallest value
(m + np.diag(np.full(len(df), +np.inf))).stack().idxmin()
Another possible solution, which uses spacial distance with a custom function to calculate the percentages (perc_change
). Matrices mat1
and mat2
compute, respectively, the values below and above the main diagonal of the final dataframe.
from scipy.spatial.distance import pdist, squareform
def perc_change(u, v):
return (v  u) / u * 100
mat1 = squareform(pdist(df[['cost']].values, lambda u, v: perc_change(v[0], u[0])))
mat2 = squareform(pdist(df[['cost']].values, lambda u, v: perc_change(u[0], v[0])))
mat = np.tril(mat1) + np.triu(mat2)
pd.DataFrame(mat, columns=df['product'].to_list(), index=df['product'].to_list())
Output:
product a product b product c
product a 0.000000 5.357143 85.714286
product b 5.084746 0.000000 76.271186
product c 46.153846 43.269231 0.000000
Here is a way using dot and np.diag
df = df.set_index('product')
df2 = df.dot(df.T)
df2 = df2.rdiv(np.diag(df2.to_numpy()),axis=0).sub(1)
Output:
product product a product b product c
product
product a 0.000000 0.050847 0.461538
product b 0.053571 0.000000 0.432692
product c 0.857143 0.762712 0.000000