Find the number of days since a max value
Question:
Given the following DataFrame:
+----+--------+------------+------+---------------------+
| id | player | match_date | stat | days_since_max_stat |
+----+--------+------------+------+---------------------+
| 1 | 1 | 2022-01-01 | 1500 | NaN |
| 2 | 1 | 2022-01-03 | 1600 | 2 |
| 3 | 1 | 2022-01-10 | 2100 | 7 |
| 4 | 1 | 2022-01-11 | 1800 | 1 |
| 5 | 1 | 2022-01-18 | 1700 | 8 |
| 6 | 2 | 2022-01-01 | 1600 | NaN |
| 7 | 2 | 2022-01-03 | 1800 | 2 |
| 8 | 2 | 2022-01-10 | 1600 | 7 |
| 9 | 2 | 2022-01-11 | 1900 | 8 |
| 10 | 2 | 2022-01-18 | 1500 | 7 |
+----+--------+------------+------+---------------------+
How would I calculate the days_since_max_stat
column? The calculation of this column is exclusive of the stat
in that row and per player
.
For example the value for the row where id
= 5 is 8 because the max stat
was in the row where id
= 3. The days_since_max_stat
= 2022-01-18 – 2022-01-10 = 8.
Here’s the base DataFrame:
import datetime as dt
import pandas as pd
dates = [
dt.datetime(2022, 1, 1),
dt.datetime(2022, 1, 3),
dt.datetime(2022, 1, 10),
dt.datetime(2022, 1, 11),
dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
{
"id": range(1, 11),
"player": [1 for i in range(5)] + [2 for i in range(5)],
"match_date": dates + dates,
"stat": (1500, 1600, 2100, 1800, 1700, 1600, 1800, 1600, 1900, 1500)
}
)
Answers:
First imagine you have only one id
, then you can use expanding
to find the cummulative max/idxmax. then you can subtract:
def day_since_max(data):
maxIdx = data['stat'].expanding().apply(pd.Series.idxmax)
date_at_max = data.loc[maxIdx, 'match_date'].shift()
return data['match_date'] - date_at_max.values
Now, we can use groupby().apply
to apply that function for each id
:
df['days_since_max'] = df.groupby('player').apply(day_since_max).reset_index(level=0, drop=True)
Output:
id player match_date stat days_since_max
0 1 1 2022-01-01 1500 NaT
1 2 1 2022-01-03 1600 2 days
2 3 1 2022-01-10 2100 7 days
3 4 1 2022-01-11 1800 1 days
4 5 1 2022-01-18 1700 8 days
5 6 2 2022-01-01 1600 NaT
6 7 2 2022-01-03 1800 2 days
7 8 2 2022-01-10 1600 7 days
8 9 2 2022-01-11 1900 8 days
9 10 2 2022-01-18 1500 7 days
You can use a double groupby
. The important part is to compute a new group to put together the rows that are lower than the last max. Once you have done that this is a simple cumsum
per group:
g = df.groupby(df['player'])
# date diff per group (days)
diff = g['match_date'].diff().dt.days
# group per lower than last max
group = df['stat'].ge(g['stat'].cummax()).shift().cumsum()
# days since last max
df['dsms'] = diff.groupby([df['player'], group]).cumsum()
Output:
id player match_date stat dsms
0 1 1 2022-01-01 1500 NaN
1 2 1 2022-01-03 1600 2.0
2 3 1 2022-01-10 2100 7.0
3 4 1 2022-01-11 1800 1.0
4 5 1 2022-01-18 1700 8.0
5 6 2 2022-01-01 1600 NaN
6 7 2 2022-01-03 1800 2.0
7 8 2 2022-01-10 1600 7.0
8 9 2 2022-01-11 1900 8.0
def function1(dd:pd.DataFrame):
if(len(dd)>1):
date1=dd.match_date.astype('<M8').iloc[-1]
data2=dd.iloc[:-1].sort_values('stat',ascending=False).match_date.astype('<M8').iloc[0]
return (date1-data2).days
return np.NaN
df.assign(days_since_max_stat=pd.Series(df1.groupby('player').expanding()).apply(function1))
out
id player match_date stat days_since_max_stat
0 1 1 2022-01-01 1500 NaN
1 2 1 2022-01-03 1600 2.0
2 3 1 2022-01-10 2100 7.0
3 4 1 2022-01-11 1800 1.0
4 5 1 2022-01-18 1700 8.0
5 6 2 2022-01-01 1600 NaN
6 7 2 2022-01-03 1800 2.0
7 8 2 2022-01-10 1600 7.0
8 9 2 2022-01-11 1900 8.0
9 10 2 2022-01-18 1500 7.0
Given the following DataFrame:
+----+--------+------------+------+---------------------+
| id | player | match_date | stat | days_since_max_stat |
+----+--------+------------+------+---------------------+
| 1 | 1 | 2022-01-01 | 1500 | NaN |
| 2 | 1 | 2022-01-03 | 1600 | 2 |
| 3 | 1 | 2022-01-10 | 2100 | 7 |
| 4 | 1 | 2022-01-11 | 1800 | 1 |
| 5 | 1 | 2022-01-18 | 1700 | 8 |
| 6 | 2 | 2022-01-01 | 1600 | NaN |
| 7 | 2 | 2022-01-03 | 1800 | 2 |
| 8 | 2 | 2022-01-10 | 1600 | 7 |
| 9 | 2 | 2022-01-11 | 1900 | 8 |
| 10 | 2 | 2022-01-18 | 1500 | 7 |
+----+--------+------------+------+---------------------+
How would I calculate the days_since_max_stat
column? The calculation of this column is exclusive of the stat
in that row and per player
.
For example the value for the row where id
= 5 is 8 because the max stat
was in the row where id
= 3. The days_since_max_stat
= 2022-01-18 – 2022-01-10 = 8.
Here’s the base DataFrame:
import datetime as dt
import pandas as pd
dates = [
dt.datetime(2022, 1, 1),
dt.datetime(2022, 1, 3),
dt.datetime(2022, 1, 10),
dt.datetime(2022, 1, 11),
dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
{
"id": range(1, 11),
"player": [1 for i in range(5)] + [2 for i in range(5)],
"match_date": dates + dates,
"stat": (1500, 1600, 2100, 1800, 1700, 1600, 1800, 1600, 1900, 1500)
}
)
First imagine you have only one id
, then you can use expanding
to find the cummulative max/idxmax. then you can subtract:
def day_since_max(data):
maxIdx = data['stat'].expanding().apply(pd.Series.idxmax)
date_at_max = data.loc[maxIdx, 'match_date'].shift()
return data['match_date'] - date_at_max.values
Now, we can use groupby().apply
to apply that function for each id
:
df['days_since_max'] = df.groupby('player').apply(day_since_max).reset_index(level=0, drop=True)
Output:
id player match_date stat days_since_max
0 1 1 2022-01-01 1500 NaT
1 2 1 2022-01-03 1600 2 days
2 3 1 2022-01-10 2100 7 days
3 4 1 2022-01-11 1800 1 days
4 5 1 2022-01-18 1700 8 days
5 6 2 2022-01-01 1600 NaT
6 7 2 2022-01-03 1800 2 days
7 8 2 2022-01-10 1600 7 days
8 9 2 2022-01-11 1900 8 days
9 10 2 2022-01-18 1500 7 days
You can use a double groupby
. The important part is to compute a new group to put together the rows that are lower than the last max. Once you have done that this is a simple cumsum
per group:
g = df.groupby(df['player'])
# date diff per group (days)
diff = g['match_date'].diff().dt.days
# group per lower than last max
group = df['stat'].ge(g['stat'].cummax()).shift().cumsum()
# days since last max
df['dsms'] = diff.groupby([df['player'], group]).cumsum()
Output:
id player match_date stat dsms
0 1 1 2022-01-01 1500 NaN
1 2 1 2022-01-03 1600 2.0
2 3 1 2022-01-10 2100 7.0
3 4 1 2022-01-11 1800 1.0
4 5 1 2022-01-18 1700 8.0
5 6 2 2022-01-01 1600 NaN
6 7 2 2022-01-03 1800 2.0
7 8 2 2022-01-10 1600 7.0
8 9 2 2022-01-11 1900 8.0
def function1(dd:pd.DataFrame):
if(len(dd)>1):
date1=dd.match_date.astype('<M8').iloc[-1]
data2=dd.iloc[:-1].sort_values('stat',ascending=False).match_date.astype('<M8').iloc[0]
return (date1-data2).days
return np.NaN
df.assign(days_since_max_stat=pd.Series(df1.groupby('player').expanding()).apply(function1))
out
id player match_date stat days_since_max_stat
0 1 1 2022-01-01 1500 NaN
1 2 1 2022-01-03 1600 2.0
2 3 1 2022-01-10 2100 7.0
3 4 1 2022-01-11 1800 1.0
4 5 1 2022-01-18 1700 8.0
5 6 2 2022-01-01 1600 NaN
6 7 2 2022-01-03 1800 2.0
7 8 2 2022-01-10 1600 7.0
8 9 2 2022-01-11 1900 8.0
9 10 2 2022-01-18 1500 7.0