How to get the slope for every n days per group with respect to a conditioned row using Pandas?
Question:
I have the following dataframe (sample):
import pandas as pd
n = 3
data = [['A', '2022-09-01', False, 2, -3], ['A', '2022-09-02', False, 1, -2], ['A', '2022-09-03', False, 1, -1], ['A', '2022-09-04', True, 3, 0],
['A', '2022-09-05', False, 3, 1], ['A', '2022-09-06', False, 2, 2], ['A', '2022-09-07', False, 1, 3], ['A', '2022-09-07', False, 2, 3],
['A', '2022-09-08', False, 4, 4], ['A', '2022-09-09', False, 2, 5],
['B', '2022-09-01', False, 2, -4], ['B', '2022-09-02', False, 2, -3], ['B', '2022-09-03', False, 4, -2], ['B', '2022-09-04', False, 2, -1],
['B', '2022-09-05', True, 2, 0], ['B', '2022-09-06', False, 2, 1], ['B', '2022-09-07', False, 1, 2], ['B', '2022-09-08', False, 3, 3],
['B', '2022-09-09', False, 3, 4], ['B', '2022-09-10', False, 2, 5]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value', 'diff_days'])
group date indicator value diff_days
0 A 2022-09-01 False 2 -3
1 A 2022-09-02 False 1 -2
2 A 2022-09-03 False 1 -1
3 A 2022-09-04 True 3 0
4 A 2022-09-05 False 3 1
5 A 2022-09-06 False 2 2
6 A 2022-09-07 False 1 3
7 A 2022-09-07 False 2 3
8 A 2022-09-08 False 4 4
9 A 2022-09-09 False 2 5
10 B 2022-09-01 False 2 -4
11 B 2022-09-02 False 2 -3
12 B 2022-09-03 False 4 -2
13 B 2022-09-04 False 2 -1
14 B 2022-09-05 True 2 0
15 B 2022-09-06 False 2 1
16 B 2022-09-07 False 1 2
17 B 2022-09-08 False 3 3
18 B 2022-09-09 False 3 4
19 B 2022-09-10 False 2 5
I would like to calculate the slope of n rows per group with respect to a conditioned row (indicator == True). So this means that it should return a column "slope" with the slopes before and after that conditioned row where this row should have a slope of 0. Besides that I would like to return a column called "id" which is actually a group id of the values representing a slope before (negative) or after (positive) that conditioned row. Here is the desired output:
data = [['A', '2022-09-01', False, 2, -3, -1, -0.5], ['A', '2022-09-02', False, 1, -2, -1, -0.5], ['A', '2022-09-03', False, 1, -1, -1, -0.5], ['A', '2022-09-04', True, 3, 0, 0, 0],
['A', '2022-09-05', False, 3, 1, 1, -1], ['A', '2022-09-06', False, 2, 2, 1, -1], ['A', '2022-09-07', False, 1, 3, 1, -1], ['A', '2022-09-07', False, 2, 3, 2, 0],
['A', '2022-09-08', False, 4, 4, 2, 0], ['A', '2022-09-09', False, 2, 5, 2, 0],
['B', '2022-09-01', False, 2, -4, -2], ['B', '2022-09-02', False, 2, -3, -1, 0], ['B', '2022-09-03', False, 4, -2, -1, 0], ['B', '2022-09-04', False, 2, -1, -1, 0],
['B', '2022-09-05', True, 2, 0, 0, 0], ['B', '2022-09-06', False, 2, 1, 1, 0.5], ['B', '2022-09-07', False, 1, 2, 1, 0.5], ['B', '2022-09-08', False, 3, 3, 1, 0.5],
['B', '2022-09-09', False, 3, 4, 2, -1], ['B', '2022-09-10', False, 2, 5, 2, -1]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value', 'diff_days', 'id', 'slope'])
group date indicator value diff_days id slope
0 A 2022-09-01 False 2 -3 -1 -0.5
1 A 2022-09-02 False 1 -2 -1 -0.5
2 A 2022-09-03 False 1 -1 -1 -0.5
3 A 2022-09-04 True 3 0 0 0.0
4 A 2022-09-05 False 3 1 1 -1.0
5 A 2022-09-06 False 2 2 1 -1.0
6 A 2022-09-07 False 1 3 1 -1.0
7 A 2022-09-07 False 2 3 2 0.0
8 A 2022-09-08 False 4 4 2 0.0
9 A 2022-09-09 False 2 5 2 0.0
10 B 2022-09-01 False 2 -4 -2 NaN
11 B 2022-09-02 False 2 -3 -1 0.0
12 B 2022-09-03 False 4 -2 -1 0.0
13 B 2022-09-04 False 2 -1 -1 0.0
14 B 2022-09-05 True 2 0 0 0.0
15 B 2022-09-06 False 2 1 1 0.5
16 B 2022-09-07 False 1 2 1 0.5
17 B 2022-09-08 False 3 3 1 0.5
18 B 2022-09-09 False 3 4 2 -1.0
19 B 2022-09-10 False 2 5 2 -1.0
Here are some explanations of group A:
- Rows 0,1 and 2 are the first values before (id=-1) the conditioned row (row 3) with slope(x=[-3,-2,-1],y=[2,1,1])=-0.5
- Rows 4,5 and 6 are the first values after (id=1) the conditioned row (row 3) with slope(x=[1,2,3],y=[3,2,1])=-1
- Rows 7,8 and 9 are the second values after (id=2) the conditioned row (row 3) with slope(x=[3,4,5],y=[2,4,2])=0
So I was wondering if anyone knows if it is possible to calculate the slopes for every n days with respect to a conditioned row using Pandas
?
Answers:
This does the job but I don’t know if there is any fancier pandas way of doing things.
groups=['A','B']
indexs=[]
for i in groups:
indexs.append(df.loc[(df['group'] == i )& (df['indicator']== True)].index[0])
id2=[]
id3=[]
for i in groups:
id2=df.loc[(df['group'] == i )].index[:]-indexs[groups.index(i)]
for j in id2:
if j < 0:
id3.append(math.floor(j/n))
elif j>=0:
id3.append(math.ceil(j/n))
df['id']=id3
grady=[]
gradx=[]
SlopeList=[]
for i in groups:
idum=[]
for number in df['id'].loc[(df['group']==i)]:
#unique values in list.
if number not in idum:
idum.append(number)
for k in idum:
grady=df['value'].loc[( df['group'] == i ) &(df['id'] == k ) ]
gradx=df['diff_days'].loc[ (df['group'] == i )&(df['id'] == k ) ]
Xm=slope(grady.tolist(),gradx.tolist()) #average slope
for m in range(0,len(gradx)): #create a suitabily sized list with the average slope value.
SlopeList.append(Xm)
df['slope']=SlopeList
p.s. I haven’t done any unit testing on this code, so please check before using it for anything.
The main idea can be:
- create individual indexes for each group;
- align zeros with marked (conditioned) rows;
- replace indexes with their floor division by
n
;
- shift positive indexes one step forward and increment them by 1 to distinguish them from zero points.
After this we can use obtained indexes as an additional grouper to calculate slopes:
# create individual indexing for eash group
id = df.groupby('group')['indicator'].cumcount()
# find positions of the condition rows in the group indexes
offset = id.where(df.indicator).groupby(df.group).first()
# shift the groups indexes so that condition rows are indexed by zero
id = id.groupby(df.group).transform(lambda x: x - offset[x.name])
# transform the group indexes to their floor division by n
# shift those which ware positive by one position forward
# and increment their values by 1
n = 3
id = (id//n).mask(id>0,(id//n).shift().add(1))
# assign obtained id to a new column
df['id'] = id
# calculate slopes for each `group,id` pair:
grouped_slopes = df.groupby(['group','id']).apply(lambda g: slope(g.diff_days, g.value))
# add slopes to the data
df = df.join(grouped_slopes , on=['group','id'])
As for the slope calculation, we can use eather of the prepared formulas or make our own. But in any case, we should also distinguish cases when there’s only one item in a group and return 0 for zero points (conditioned rows) and nan
for single element tails:
from typing import Literal
def slope(x, y, engine: Literal['numpy', 'scipy']='numpy'):
from numpy import polyfit
from scipy.stats import linregress
match engine:
case 'numpy':
func = lambda x, y: polyfit(x, y, 1)[0]
case 'scipy':
func = lambda x, y: linregress(x, y).slope
case other:
raise ValueError(f'Wrong {engine=}')
if len(x) > 1:
return func(x, y)
if len(x) == 1 and x.iloc[0] == 0:
return 0
return float('nan')
I have the following dataframe (sample):
import pandas as pd
n = 3
data = [['A', '2022-09-01', False, 2, -3], ['A', '2022-09-02', False, 1, -2], ['A', '2022-09-03', False, 1, -1], ['A', '2022-09-04', True, 3, 0],
['A', '2022-09-05', False, 3, 1], ['A', '2022-09-06', False, 2, 2], ['A', '2022-09-07', False, 1, 3], ['A', '2022-09-07', False, 2, 3],
['A', '2022-09-08', False, 4, 4], ['A', '2022-09-09', False, 2, 5],
['B', '2022-09-01', False, 2, -4], ['B', '2022-09-02', False, 2, -3], ['B', '2022-09-03', False, 4, -2], ['B', '2022-09-04', False, 2, -1],
['B', '2022-09-05', True, 2, 0], ['B', '2022-09-06', False, 2, 1], ['B', '2022-09-07', False, 1, 2], ['B', '2022-09-08', False, 3, 3],
['B', '2022-09-09', False, 3, 4], ['B', '2022-09-10', False, 2, 5]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value', 'diff_days'])
group date indicator value diff_days
0 A 2022-09-01 False 2 -3
1 A 2022-09-02 False 1 -2
2 A 2022-09-03 False 1 -1
3 A 2022-09-04 True 3 0
4 A 2022-09-05 False 3 1
5 A 2022-09-06 False 2 2
6 A 2022-09-07 False 1 3
7 A 2022-09-07 False 2 3
8 A 2022-09-08 False 4 4
9 A 2022-09-09 False 2 5
10 B 2022-09-01 False 2 -4
11 B 2022-09-02 False 2 -3
12 B 2022-09-03 False 4 -2
13 B 2022-09-04 False 2 -1
14 B 2022-09-05 True 2 0
15 B 2022-09-06 False 2 1
16 B 2022-09-07 False 1 2
17 B 2022-09-08 False 3 3
18 B 2022-09-09 False 3 4
19 B 2022-09-10 False 2 5
I would like to calculate the slope of n rows per group with respect to a conditioned row (indicator == True). So this means that it should return a column "slope" with the slopes before and after that conditioned row where this row should have a slope of 0. Besides that I would like to return a column called "id" which is actually a group id of the values representing a slope before (negative) or after (positive) that conditioned row. Here is the desired output:
data = [['A', '2022-09-01', False, 2, -3, -1, -0.5], ['A', '2022-09-02', False, 1, -2, -1, -0.5], ['A', '2022-09-03', False, 1, -1, -1, -0.5], ['A', '2022-09-04', True, 3, 0, 0, 0],
['A', '2022-09-05', False, 3, 1, 1, -1], ['A', '2022-09-06', False, 2, 2, 1, -1], ['A', '2022-09-07', False, 1, 3, 1, -1], ['A', '2022-09-07', False, 2, 3, 2, 0],
['A', '2022-09-08', False, 4, 4, 2, 0], ['A', '2022-09-09', False, 2, 5, 2, 0],
['B', '2022-09-01', False, 2, -4, -2], ['B', '2022-09-02', False, 2, -3, -1, 0], ['B', '2022-09-03', False, 4, -2, -1, 0], ['B', '2022-09-04', False, 2, -1, -1, 0],
['B', '2022-09-05', True, 2, 0, 0, 0], ['B', '2022-09-06', False, 2, 1, 1, 0.5], ['B', '2022-09-07', False, 1, 2, 1, 0.5], ['B', '2022-09-08', False, 3, 3, 1, 0.5],
['B', '2022-09-09', False, 3, 4, 2, -1], ['B', '2022-09-10', False, 2, 5, 2, -1]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value', 'diff_days', 'id', 'slope'])
group date indicator value diff_days id slope
0 A 2022-09-01 False 2 -3 -1 -0.5
1 A 2022-09-02 False 1 -2 -1 -0.5
2 A 2022-09-03 False 1 -1 -1 -0.5
3 A 2022-09-04 True 3 0 0 0.0
4 A 2022-09-05 False 3 1 1 -1.0
5 A 2022-09-06 False 2 2 1 -1.0
6 A 2022-09-07 False 1 3 1 -1.0
7 A 2022-09-07 False 2 3 2 0.0
8 A 2022-09-08 False 4 4 2 0.0
9 A 2022-09-09 False 2 5 2 0.0
10 B 2022-09-01 False 2 -4 -2 NaN
11 B 2022-09-02 False 2 -3 -1 0.0
12 B 2022-09-03 False 4 -2 -1 0.0
13 B 2022-09-04 False 2 -1 -1 0.0
14 B 2022-09-05 True 2 0 0 0.0
15 B 2022-09-06 False 2 1 1 0.5
16 B 2022-09-07 False 1 2 1 0.5
17 B 2022-09-08 False 3 3 1 0.5
18 B 2022-09-09 False 3 4 2 -1.0
19 B 2022-09-10 False 2 5 2 -1.0
Here are some explanations of group A:
- Rows 0,1 and 2 are the first values before (id=-1) the conditioned row (row 3) with slope(x=[-3,-2,-1],y=[2,1,1])=-0.5
- Rows 4,5 and 6 are the first values after (id=1) the conditioned row (row 3) with slope(x=[1,2,3],y=[3,2,1])=-1
- Rows 7,8 and 9 are the second values after (id=2) the conditioned row (row 3) with slope(x=[3,4,5],y=[2,4,2])=0
So I was wondering if anyone knows if it is possible to calculate the slopes for every n days with respect to a conditioned row using Pandas
?
This does the job but I don’t know if there is any fancier pandas way of doing things.
groups=['A','B']
indexs=[]
for i in groups:
indexs.append(df.loc[(df['group'] == i )& (df['indicator']== True)].index[0])
id2=[]
id3=[]
for i in groups:
id2=df.loc[(df['group'] == i )].index[:]-indexs[groups.index(i)]
for j in id2:
if j < 0:
id3.append(math.floor(j/n))
elif j>=0:
id3.append(math.ceil(j/n))
df['id']=id3
grady=[]
gradx=[]
SlopeList=[]
for i in groups:
idum=[]
for number in df['id'].loc[(df['group']==i)]:
#unique values in list.
if number not in idum:
idum.append(number)
for k in idum:
grady=df['value'].loc[( df['group'] == i ) &(df['id'] == k ) ]
gradx=df['diff_days'].loc[ (df['group'] == i )&(df['id'] == k ) ]
Xm=slope(grady.tolist(),gradx.tolist()) #average slope
for m in range(0,len(gradx)): #create a suitabily sized list with the average slope value.
SlopeList.append(Xm)
df['slope']=SlopeList
p.s. I haven’t done any unit testing on this code, so please check before using it for anything.
The main idea can be:
- create individual indexes for each group;
- align zeros with marked (conditioned) rows;
- replace indexes with their floor division by
n
; - shift positive indexes one step forward and increment them by 1 to distinguish them from zero points.
After this we can use obtained indexes as an additional grouper to calculate slopes:
# create individual indexing for eash group
id = df.groupby('group')['indicator'].cumcount()
# find positions of the condition rows in the group indexes
offset = id.where(df.indicator).groupby(df.group).first()
# shift the groups indexes so that condition rows are indexed by zero
id = id.groupby(df.group).transform(lambda x: x - offset[x.name])
# transform the group indexes to their floor division by n
# shift those which ware positive by one position forward
# and increment their values by 1
n = 3
id = (id//n).mask(id>0,(id//n).shift().add(1))
# assign obtained id to a new column
df['id'] = id
# calculate slopes for each `group,id` pair:
grouped_slopes = df.groupby(['group','id']).apply(lambda g: slope(g.diff_days, g.value))
# add slopes to the data
df = df.join(grouped_slopes , on=['group','id'])
As for the slope calculation, we can use eather of the prepared formulas or make our own. But in any case, we should also distinguish cases when there’s only one item in a group and return 0 for zero points (conditioned rows) and nan
for single element tails:
from typing import Literal
def slope(x, y, engine: Literal['numpy', 'scipy']='numpy'):
from numpy import polyfit
from scipy.stats import linregress
match engine:
case 'numpy':
func = lambda x, y: polyfit(x, y, 1)[0]
case 'scipy':
func = lambda x, y: linregress(x, y).slope
case other:
raise ValueError(f'Wrong {engine=}')
if len(x) > 1:
return func(x, y)
if len(x) == 1 and x.iloc[0] == 0:
return 0
return float('nan')