Pandas percentage change matrix
Question:
I have a data frame:
product
cost
0
product a
56
1
product b
59
2
product c
104
I’d like to make a percentage change matrix like:
product a
product b
product c
product a
-5.08%
-46.15%
product b
5.36%
-43.30%
product c
85.71%
76.27%
There could be n number of products.
-
How do I this using pandas?
-
How do I get the highest / lowest percentage change products? i.e. Highest: product a vs. product c. Lowest: product c vs. product a.
Thank you for your help.
Answers:
Use numpy broadcasting:
# convert columns to arrays
idx = df['product'].to_numpy()
cost = df['cost'].to_numpy()
# compute the percentage change using broadcasting
# convert to DataFrame
out = pd.DataFrame(((cost[:,None]-cost)/cost*100).round(2),
index=idx, columns=idx)
# optional, set NaNs in the diagonal
np.fill_diagonal(out.values, np.nan)
print(out)
Output:
product a product b product c
product a NaN -5.08 -46.15
product b 5.36 NaN -43.27
product c 85.71 76.27 NaN
question 1
Here is a short way to do the math
import pandas as pd
import numpy as np
df = pd.DataFrame([
["product a", 56],
["product b", 59],
["product c", 104]
], columns=["product", "cost"])
m = pd.DataFrame(
data=np.array(df.cost) * np.ones((3, 3)),
index=df["product"],
columns=df["product"],
)
m.index.name = None
m.columns.name = None
m = (m.T-m) / m # this is where the actual calculation happens
m
result is
question 2
# products with largest change (looks complicated to avoid that product a is compared to itself)
(m + np.diag(np.full(len(df),-np.inf))).idxmax(axis=0)
# products with smallest change
(m + np.diag(np.full(len(df),np.inf))).idxmin(axis=0)
edit
OP asks for the single highest / lowest value in matrix m
# index of largest value
(m + np.diag(np.full(len(df), -np.inf))).stack().idxmax()
# index of smallest value
(m + np.diag(np.full(len(df), +np.inf))).stack().idxmin()
Another possible solution, which uses spacial distance with a custom function to calculate the percentages (perc_change
). Matrices mat1
and mat2
compute, respectively, the values below and above the main diagonal of the final dataframe.
from scipy.spatial.distance import pdist, squareform
def perc_change(u, v):
return (v - u) / u * 100
mat1 = squareform(pdist(df[['cost']].values, lambda u, v: perc_change(v[0], u[0])))
mat2 = squareform(pdist(df[['cost']].values, lambda u, v: perc_change(u[0], v[0])))
mat = np.tril(mat1) + np.triu(mat2)
pd.DataFrame(mat, columns=df['product'].to_list(), index=df['product'].to_list())
Output:
product a product b product c
product a 0.000000 5.357143 85.714286
product b -5.084746 0.000000 76.271186
product c -46.153846 -43.269231 0.000000
Here is a way using dot and np.diag
df = df.set_index('product')
df2 = df.dot(df.T)
df2 = df2.rdiv(np.diag(df2.to_numpy()),axis=0).sub(1)
Output:
product product a product b product c
product
product a 0.000000 -0.050847 -0.461538
product b 0.053571 0.000000 -0.432692
product c 0.857143 0.762712 0.000000
I have a data frame:
product | cost | |
---|---|---|
0 | product a | 56 |
1 | product b | 59 |
2 | product c | 104 |
I’d like to make a percentage change matrix like:
product a | product b | product c | |
---|---|---|---|
product a | -5.08% | -46.15% | |
product b | 5.36% | -43.30% | |
product c | 85.71% | 76.27% |
There could be n number of products.
-
How do I this using pandas?
-
How do I get the highest / lowest percentage change products? i.e. Highest: product a vs. product c. Lowest: product c vs. product a.
Thank you for your help.
Use numpy broadcasting:
# convert columns to arrays
idx = df['product'].to_numpy()
cost = df['cost'].to_numpy()
# compute the percentage change using broadcasting
# convert to DataFrame
out = pd.DataFrame(((cost[:,None]-cost)/cost*100).round(2),
index=idx, columns=idx)
# optional, set NaNs in the diagonal
np.fill_diagonal(out.values, np.nan)
print(out)
Output:
product a product b product c
product a NaN -5.08 -46.15
product b 5.36 NaN -43.27
product c 85.71 76.27 NaN
question 1
Here is a short way to do the math
import pandas as pd
import numpy as np
df = pd.DataFrame([
["product a", 56],
["product b", 59],
["product c", 104]
], columns=["product", "cost"])
m = pd.DataFrame(
data=np.array(df.cost) * np.ones((3, 3)),
index=df["product"],
columns=df["product"],
)
m.index.name = None
m.columns.name = None
m = (m.T-m) / m # this is where the actual calculation happens
m
result is
question 2
# products with largest change (looks complicated to avoid that product a is compared to itself)
(m + np.diag(np.full(len(df),-np.inf))).idxmax(axis=0)
# products with smallest change
(m + np.diag(np.full(len(df),np.inf))).idxmin(axis=0)
edit
OP asks for the single highest / lowest value in matrix m
# index of largest value
(m + np.diag(np.full(len(df), -np.inf))).stack().idxmax()
# index of smallest value
(m + np.diag(np.full(len(df), +np.inf))).stack().idxmin()
Another possible solution, which uses spacial distance with a custom function to calculate the percentages (perc_change
). Matrices mat1
and mat2
compute, respectively, the values below and above the main diagonal of the final dataframe.
from scipy.spatial.distance import pdist, squareform
def perc_change(u, v):
return (v - u) / u * 100
mat1 = squareform(pdist(df[['cost']].values, lambda u, v: perc_change(v[0], u[0])))
mat2 = squareform(pdist(df[['cost']].values, lambda u, v: perc_change(u[0], v[0])))
mat = np.tril(mat1) + np.triu(mat2)
pd.DataFrame(mat, columns=df['product'].to_list(), index=df['product'].to_list())
Output:
product a product b product c
product a 0.000000 5.357143 85.714286
product b -5.084746 0.000000 76.271186
product c -46.153846 -43.269231 0.000000
Here is a way using dot and np.diag
df = df.set_index('product')
df2 = df.dot(df.T)
df2 = df2.rdiv(np.diag(df2.to_numpy()),axis=0).sub(1)
Output:
product product a product b product c
product
product a 0.000000 -0.050847 -0.461538
product b 0.053571 0.000000 -0.432692
product c 0.857143 0.762712 0.000000