Year to Date Returns in Pandas DataFrame
Question:
I’d like to have a running year to date pct change column in my pandas dataframe:
Here is the dataframe:
dollar
Date
2015-01-01 97264.15
2015-02-01 102849.06
2015-03-01 101660.56
2015-04-01 102286.16
2015-05-01 103613.20
... ...
2020-12-01 197212.20
2021-01-01 196553.61
2021-02-01 202724.09
2021-03-01 210113.78
2021-04-01 220696.22
I can get a dataframe with year ending values and run pct_change on the dataframe:
df = df.groupby(pd.Grouper(level='Date', freq='A')).nth(-1)
df['Year'] = df['dollar'].pct_change(1)
But what I’d like is to have the monthly dataframe with a running YTD column.
Update: This gets me close…..
dfGrouped = df.groupby(pd.Grouper(level = 'Date', freq='A'))
df['YTD'] = dfGrouped['dollar'].transform(lambda x: x/x.iloc[0]-1.0)
dollar YTD
Date
2020-12-01 197212.20 0.231018
2021-01-01 196553.61 0.000000
2021-02-01 202724.09 0.031393
2021-03-01 210113.78 0.068990
2021-04-01 220696.22 0.122830
But it is ‘off’ by 1 month. For example, the April 2021 YTD value is using the Jan 2021 value for the calculation instead of Dec 2020.
Thanks. Any help is greatly appreciated.
Nina
Answers:
If I understand you well, you want the running percent change with respect to the last value of the previous year. It’s maybe not the most elegant, but you can explicitly build this last-value-of-previous-year series.
To start, you build a series with the date indices and years as values:
>>> df.index.to_series().dt.year.rename('year')
Date
2015-01-01 2015
2015-02-01 2015
2015-03-01 2015
2015-04-01 2015
2015-05-01 2015
2020-12-01 2020
2021-01-01 2021
2021-02-01 2021
2021-03-01 2021
2021-04-01 2021
Name: year, dtype: int64
Now we can pass this to the groupby
which as a result will just have the years as index, not the latest date of that year*:
>>> last_per_year = df['dollar'].groupby(year).last()
>>> last_per_year
year
2015 103613.20
2020 197212.20
2021 220696.22
Name: dollar, dtype: float64
So to get the previous year’s value you only have to shift()
,
and using year
we can re-broadcast these values to the original shape:
>>> ref_dollar_yearly = year.map(last_per_year.shift())
>>> ref_dollar_yearly
Date
2015-01-01 NaN
2015-02-01 NaN
2015-03-01 NaN
2015-04-01 NaN
2015-05-01 NaN
2020-12-01 103613.2
2021-01-01 197212.2
2021-02-01 197212.2
2021-03-01 197212.2
2021-04-01 197212.2
Name: year, dtype: float64
Of course the first year (here 2015
) has no reference value from the previous year. Maybe a some kind of join or merge could work instead of map (year.reset_index().merge(last_per_year.shift(), how='left', on='year').set_index('Date')['dollar']
− it’s uglier, but maybe faster if there are many years?)
You already know how to do the rest:
>>> df['YTD'] = df['dollar'] / ref_dollar_yearly - 1
>>> df
dollar YTD
Date
2015-01-01 97264.15 NaN
2015-02-01 102849.06 NaN
2015-03-01 101660.56 NaN
2015-04-01 102286.16 NaN
2015-05-01 103613.20 NaN
2020-12-01 197212.20 0.903350
2021-01-01 196553.61 -0.003339
2021-02-01 202724.09 0.027949
2021-03-01 210113.78 0.065420
2021-04-01 220696.22 0.119080
* Note that there is another subtlety here, in the case of missing years.
df['dollar'].groupby(year).last()
, just as df['dollar'].groupby(year).nth(-1)
does not return any value for missing years
df['dollar'].groupby(pd.Grouper(level='Date', freq='A')).last()
returns nan
for missing years
This is important since you want to divide by the previous year, in the small example here I’m dividing 2020’s results by a value from 2015. To avoid this, you’ll need to reindex the dataframe before shift()
ing:
>>> last_per_year.reindex(pd.RangeIndex(last_per_year.index.min(), last_per_year.index.max() + 1)).shift()
2015 NaN
2016 103613.2
2017 NaN
2018 NaN
2019 NaN
2020 NaN
2021 197212.2
Name: dollar, dtype: float64
Had the same problem. Managed to fix it with a one-liner.
Essentially, for each date of the original data frame df['dollar'].groupby(df.index.year).transform('first')
provides the first value available for each year on the dataset.
import pandas as pd
import yfinance as yf
df=yf.download("SPY")
df['ytd_return'] = df['Close'] / df['Close'].groupby(df.index.year).transform('first')
Happy Coding!
I’d like to have a running year to date pct change column in my pandas dataframe:
Here is the dataframe:
dollar
Date
2015-01-01 97264.15
2015-02-01 102849.06
2015-03-01 101660.56
2015-04-01 102286.16
2015-05-01 103613.20
... ...
2020-12-01 197212.20
2021-01-01 196553.61
2021-02-01 202724.09
2021-03-01 210113.78
2021-04-01 220696.22
I can get a dataframe with year ending values and run pct_change on the dataframe:
df = df.groupby(pd.Grouper(level='Date', freq='A')).nth(-1)
df['Year'] = df['dollar'].pct_change(1)
But what I’d like is to have the monthly dataframe with a running YTD column.
Update: This gets me close…..
dfGrouped = df.groupby(pd.Grouper(level = 'Date', freq='A'))
df['YTD'] = dfGrouped['dollar'].transform(lambda x: x/x.iloc[0]-1.0)
dollar YTD
Date
2020-12-01 197212.20 0.231018
2021-01-01 196553.61 0.000000
2021-02-01 202724.09 0.031393
2021-03-01 210113.78 0.068990
2021-04-01 220696.22 0.122830
But it is ‘off’ by 1 month. For example, the April 2021 YTD value is using the Jan 2021 value for the calculation instead of Dec 2020.
Thanks. Any help is greatly appreciated.
Nina
If I understand you well, you want the running percent change with respect to the last value of the previous year. It’s maybe not the most elegant, but you can explicitly build this last-value-of-previous-year series.
To start, you build a series with the date indices and years as values:
>>> df.index.to_series().dt.year.rename('year')
Date
2015-01-01 2015
2015-02-01 2015
2015-03-01 2015
2015-04-01 2015
2015-05-01 2015
2020-12-01 2020
2021-01-01 2021
2021-02-01 2021
2021-03-01 2021
2021-04-01 2021
Name: year, dtype: int64
Now we can pass this to the groupby
which as a result will just have the years as index, not the latest date of that year*:
>>> last_per_year = df['dollar'].groupby(year).last()
>>> last_per_year
year
2015 103613.20
2020 197212.20
2021 220696.22
Name: dollar, dtype: float64
So to get the previous year’s value you only have to shift()
,
and using year
we can re-broadcast these values to the original shape:
>>> ref_dollar_yearly = year.map(last_per_year.shift())
>>> ref_dollar_yearly
Date
2015-01-01 NaN
2015-02-01 NaN
2015-03-01 NaN
2015-04-01 NaN
2015-05-01 NaN
2020-12-01 103613.2
2021-01-01 197212.2
2021-02-01 197212.2
2021-03-01 197212.2
2021-04-01 197212.2
Name: year, dtype: float64
Of course the first year (here 2015
) has no reference value from the previous year. Maybe a some kind of join or merge could work instead of map (year.reset_index().merge(last_per_year.shift(), how='left', on='year').set_index('Date')['dollar']
− it’s uglier, but maybe faster if there are many years?)
You already know how to do the rest:
>>> df['YTD'] = df['dollar'] / ref_dollar_yearly - 1
>>> df
dollar YTD
Date
2015-01-01 97264.15 NaN
2015-02-01 102849.06 NaN
2015-03-01 101660.56 NaN
2015-04-01 102286.16 NaN
2015-05-01 103613.20 NaN
2020-12-01 197212.20 0.903350
2021-01-01 196553.61 -0.003339
2021-02-01 202724.09 0.027949
2021-03-01 210113.78 0.065420
2021-04-01 220696.22 0.119080
* Note that there is another subtlety here, in the case of missing years.
df['dollar'].groupby(year).last()
, just asdf['dollar'].groupby(year).nth(-1)
does not return any value for missing yearsdf['dollar'].groupby(pd.Grouper(level='Date', freq='A')).last()
returnsnan
for missing years
This is important since you want to divide by the previous year, in the small example here I’m dividing 2020’s results by a value from 2015. To avoid this, you’ll need to reindex the dataframe before shift()
ing:
>>> last_per_year.reindex(pd.RangeIndex(last_per_year.index.min(), last_per_year.index.max() + 1)).shift()
2015 NaN
2016 103613.2
2017 NaN
2018 NaN
2019 NaN
2020 NaN
2021 197212.2
Name: dollar, dtype: float64
Had the same problem. Managed to fix it with a one-liner.
Essentially, for each date of the original data frame df['dollar'].groupby(df.index.year).transform('first')
provides the first value available for each year on the dataset.
import pandas as pd
import yfinance as yf
df=yf.download("SPY")
df['ytd_return'] = df['Close'] / df['Close'].groupby(df.index.year).transform('first')
Happy Coding!