Calculate percentage change between values of column in Pandas dataframe
Question:
I have a dataframe with some price indices across 5 years, from 2017 to 2021. It looks like this:
Country
Industry
Year
Index
US
Agriculture
2017
83
US
Agriculture
2018
97.2
US
Agriculture
2019
100
US
Agriculture
2020
112
US
Agriculture
2021
108
Japan
Mining
2017
88
Japan
Mining
2018
93
Japan
Mining
2019
100
Japan
Mining
2020
104
Japan
Mining
2021
112
My base year is 2019, hence the Index for every row tagged with 2019 is 100. Everything else moves up or down. I want to generate another column called Percentage_Change
showing the year on year change starting from 2019 as the base year.
I tried using the pd.series.pct_change
function, however, that calculates the year on year percentage change starting with 2017 and it generates an NaN
value for all rows where the year is 2017, instead of 2019 which should be the base year.
I want the output to look like this:
Country
Industry
Year
Index
Percentage_change
Japan
Mining
2017
88
-5.37%
Japan
Mining
2018
93
-7%
Japan
Mining
2019
100
0
Japan
Mining
2020
104
4%
Japan
Mining
2021
112
7.69%
The percentage_change
for Japan between 2021 and 2020 is (112-104)/104 = 7.69%, the difference between 2020 and 2019 is (104-100)/100 = 4%, the difference between 2018 and 2019 is (93-100)/100 = -7%, the difference between 2017 and 2018 is (88-93)/93 = -5.37%
Is there any other way of calculating % change in pandas?
Answers:
pct_change
is computing a change relative to the previous value (which is why 2017 is NaN), and this doesn’t seem to be what you want. If you want to compute a percentage change relative to 2019, as 2019 is already normalized to 100, simply subtract 100:
df['Percentage_Change'] = df['Index'].sub(100)
output:
Country Industry Year Index Percentage_Change
0 US Agriculture 2017 83.0 -17.0
1 US Agriculture 2018 97.2 -2.8
2 US Agriculture 2019 100.0 0.0
3 US Agriculture 2020 112.0 12.0
4 US Agriculture 2021 108.0 8.0
5 Japan Mining 2017 88.0 -12.0
6 Japan Mining 2018 93.0 -7.0
7 Japan Mining 2019 100.0 0.0
8 Japan Mining 2020 104.0 4.0
9 Japan Mining 2021 112.0 12.0
bidirectional pct_change
If you want a bidirectional pct_change
"centered" on 100, you can use masks to compute the pct_change
both ways:
df['Percentage_Change'] = (df
.assign(ref=df['Year'].eq(2019))
.groupby(['Country', 'Industry'], group_keys=False)
.apply(lambda g: g['Index'].where(g['ref'].cummax()).pct_change()
.fillna(g['Index'][::-1].pct_change().mask(g['ref'].cummax(), 0))
)
)
output:
Country Industry Year Index Percentage_Change
0 US Agriculture 2017 83.0 -0.146091
1 US Agriculture 2018 97.2 -0.028000
2 US Agriculture 2019 100.0 0.000000
3 US Agriculture 2020 112.0 0.120000
4 US Agriculture 2021 108.0 -0.035714
5 Japan Mining 2017 88.0 -0.053763
6 Japan Mining 2018 93.0 -0.070000
7 Japan Mining 2019 100.0 0.000000
8 Japan Mining 2020 104.0 0.040000
9 Japan Mining 2021 112.0 0.076923
def function1(dd:pd.DataFrame):
ss1=dd.query("Year==2019").iloc[0,:]
return dd.apply(lambda ss:pd.concat([ss1,ss]).Index.pct_change().iloc[1],axis=1)
df1.assign(Percentage_change=df1.groupby('Country').apply(function1).droplevel(0))
Country Industry Year Index Percentage_change
0 US Agriculture 2017 83.0 -0.170
1 US Agriculture 2018 97.2 -0.028
2 US Agriculture 2019 100.0 0.000
3 US Agriculture 2020 112.0 0.120
4 US Agriculture 2021 108.0 0.080
5 Japan Mining 2017 88.0 -0.120
6 Japan Mining 2018 93.0 -0.070
7 Japan Mining 2019 100.0 0.000
8 Japan Mining 2020 104.0 0.040
9 Japan Mining 2021 112.0 0.120
I have a dataframe with some price indices across 5 years, from 2017 to 2021. It looks like this:
Country | Industry | Year | Index |
---|---|---|---|
US | Agriculture | 2017 | 83 |
US | Agriculture | 2018 | 97.2 |
US | Agriculture | 2019 | 100 |
US | Agriculture | 2020 | 112 |
US | Agriculture | 2021 | 108 |
Japan | Mining | 2017 | 88 |
Japan | Mining | 2018 | 93 |
Japan | Mining | 2019 | 100 |
Japan | Mining | 2020 | 104 |
Japan | Mining | 2021 | 112 |
My base year is 2019, hence the Index for every row tagged with 2019 is 100. Everything else moves up or down. I want to generate another column called Percentage_Change
showing the year on year change starting from 2019 as the base year.
I tried using the pd.series.pct_change
function, however, that calculates the year on year percentage change starting with 2017 and it generates an NaN
value for all rows where the year is 2017, instead of 2019 which should be the base year.
I want the output to look like this:
Country | Industry | Year | Index | Percentage_change |
---|---|---|---|---|
Japan | Mining | 2017 | 88 | -5.37% |
Japan | Mining | 2018 | 93 | -7% |
Japan | Mining | 2019 | 100 | 0 |
Japan | Mining | 2020 | 104 | 4% |
Japan | Mining | 2021 | 112 | 7.69% |
The percentage_change
for Japan between 2021 and 2020 is (112-104)/104 = 7.69%, the difference between 2020 and 2019 is (104-100)/100 = 4%, the difference between 2018 and 2019 is (93-100)/100 = -7%, the difference between 2017 and 2018 is (88-93)/93 = -5.37%
Is there any other way of calculating % change in pandas?
pct_change
is computing a change relative to the previous value (which is why 2017 is NaN), and this doesn’t seem to be what you want. If you want to compute a percentage change relative to 2019, as 2019 is already normalized to 100, simply subtract 100:
df['Percentage_Change'] = df['Index'].sub(100)
output:
Country Industry Year Index Percentage_Change
0 US Agriculture 2017 83.0 -17.0
1 US Agriculture 2018 97.2 -2.8
2 US Agriculture 2019 100.0 0.0
3 US Agriculture 2020 112.0 12.0
4 US Agriculture 2021 108.0 8.0
5 Japan Mining 2017 88.0 -12.0
6 Japan Mining 2018 93.0 -7.0
7 Japan Mining 2019 100.0 0.0
8 Japan Mining 2020 104.0 4.0
9 Japan Mining 2021 112.0 12.0
bidirectional pct_change
If you want a bidirectional pct_change
"centered" on 100, you can use masks to compute the pct_change
both ways:
df['Percentage_Change'] = (df
.assign(ref=df['Year'].eq(2019))
.groupby(['Country', 'Industry'], group_keys=False)
.apply(lambda g: g['Index'].where(g['ref'].cummax()).pct_change()
.fillna(g['Index'][::-1].pct_change().mask(g['ref'].cummax(), 0))
)
)
output:
Country Industry Year Index Percentage_Change
0 US Agriculture 2017 83.0 -0.146091
1 US Agriculture 2018 97.2 -0.028000
2 US Agriculture 2019 100.0 0.000000
3 US Agriculture 2020 112.0 0.120000
4 US Agriculture 2021 108.0 -0.035714
5 Japan Mining 2017 88.0 -0.053763
6 Japan Mining 2018 93.0 -0.070000
7 Japan Mining 2019 100.0 0.000000
8 Japan Mining 2020 104.0 0.040000
9 Japan Mining 2021 112.0 0.076923
def function1(dd:pd.DataFrame):
ss1=dd.query("Year==2019").iloc[0,:]
return dd.apply(lambda ss:pd.concat([ss1,ss]).Index.pct_change().iloc[1],axis=1)
df1.assign(Percentage_change=df1.groupby('Country').apply(function1).droplevel(0))
Country Industry Year Index Percentage_change
0 US Agriculture 2017 83.0 -0.170
1 US Agriculture 2018 97.2 -0.028
2 US Agriculture 2019 100.0 0.000
3 US Agriculture 2020 112.0 0.120
4 US Agriculture 2021 108.0 0.080
5 Japan Mining 2017 88.0 -0.120
6 Japan Mining 2018 93.0 -0.070
7 Japan Mining 2019 100.0 0.000
8 Japan Mining 2020 104.0 0.040
9 Japan Mining 2021 112.0 0.120