How to get the slope for every n days per group with respect to a conditioned row using Pandas?

Question:

I have the following dataframe (sample):

import pandas as pd

n = 3

data = [['A', '2022-09-01', False, 2, -3], ['A', '2022-09-02', False, 1, -2], ['A', '2022-09-03', False, 1, -1], ['A', '2022-09-04', True, 3, 0], 
        ['A', '2022-09-05', False, 3, 1], ['A', '2022-09-06', False, 2, 2], ['A', '2022-09-07', False, 1, 3], ['A', '2022-09-07', False, 2, 3], 
        ['A', '2022-09-08', False, 4, 4], ['A', '2022-09-09', False, 2, 5],
        ['B', '2022-09-01', False, 2, -4], ['B', '2022-09-02', False, 2, -3], ['B', '2022-09-03', False, 4, -2], ['B', '2022-09-04', False, 2, -1], 
        ['B', '2022-09-05', True, 2, 0], ['B', '2022-09-06', False, 2, 1], ['B', '2022-09-07', False, 1, 2], ['B', '2022-09-08', False, 3, 3], 
        ['B', '2022-09-09', False, 3, 4], ['B', '2022-09-10', False, 2, 5]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value', 'diff_days'])

   group        date  indicator  value  diff_days
0      A  2022-09-01      False      2         -3
1      A  2022-09-02      False      1         -2
2      A  2022-09-03      False      1         -1
3      A  2022-09-04       True      3          0
4      A  2022-09-05      False      3          1
5      A  2022-09-06      False      2          2
6      A  2022-09-07      False      1          3
7      A  2022-09-07      False      2          3
8      A  2022-09-08      False      4          4
9      A  2022-09-09      False      2          5
10     B  2022-09-01      False      2         -4
11     B  2022-09-02      False      2         -3
12     B  2022-09-03      False      4         -2
13     B  2022-09-04      False      2         -1
14     B  2022-09-05       True      2          0
15     B  2022-09-06      False      2          1
16     B  2022-09-07      False      1          2
17     B  2022-09-08      False      3          3
18     B  2022-09-09      False      3          4
19     B  2022-09-10      False      2          5

I would like to calculate the slope of n rows per group with respect to a conditioned row (indicator == True). So this means that it should return a column "slope" with the slopes before and after that conditioned row where this row should have a slope of 0. Besides that I would like to return a column called "id" which is actually a group id of the values representing a slope before (negative) or after (positive) that conditioned row. Here is the desired output:

data = [['A', '2022-09-01', False, 2, -3, -1, -0.5], ['A', '2022-09-02', False, 1, -2, -1, -0.5], ['A', '2022-09-03', False, 1, -1, -1, -0.5], ['A', '2022-09-04', True, 3, 0, 0, 0], 
        ['A', '2022-09-05', False, 3, 1, 1, -1], ['A', '2022-09-06', False, 2, 2, 1, -1], ['A', '2022-09-07', False, 1, 3, 1, -1], ['A', '2022-09-07', False, 2, 3, 2, 0], 
        ['A', '2022-09-08', False, 4, 4, 2, 0], ['A', '2022-09-09', False, 2, 5, 2, 0],
        ['B', '2022-09-01', False, 2, -4, -2], ['B', '2022-09-02', False, 2, -3, -1, 0], ['B', '2022-09-03', False, 4, -2, -1, 0], ['B', '2022-09-04', False, 2, -1, -1, 0], 
        ['B', '2022-09-05', True, 2, 0, 0, 0], ['B', '2022-09-06', False, 2, 1, 1, 0.5], ['B', '2022-09-07', False, 1, 2, 1, 0.5], ['B', '2022-09-08', False, 3, 3, 1, 0.5], 
        ['B', '2022-09-09', False, 3, 4, 2, -1], ['B', '2022-09-10', False, 2, 5, 2, -1]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value', 'diff_days', 'id', 'slope'])

   group        date  indicator  value  diff_days  id  slope
0      A  2022-09-01      False      2         -3  -1   -0.5
1      A  2022-09-02      False      1         -2  -1   -0.5
2      A  2022-09-03      False      1         -1  -1   -0.5
3      A  2022-09-04       True      3          0   0    0.0
4      A  2022-09-05      False      3          1   1   -1.0
5      A  2022-09-06      False      2          2   1   -1.0
6      A  2022-09-07      False      1          3   1   -1.0
7      A  2022-09-07      False      2          3   2    0.0
8      A  2022-09-08      False      4          4   2    0.0
9      A  2022-09-09      False      2          5   2    0.0
10     B  2022-09-01      False      2         -4  -2    NaN
11     B  2022-09-02      False      2         -3  -1    0.0
12     B  2022-09-03      False      4         -2  -1    0.0
13     B  2022-09-04      False      2         -1  -1    0.0
14     B  2022-09-05       True      2          0   0    0.0
15     B  2022-09-06      False      2          1   1    0.5
16     B  2022-09-07      False      1          2   1    0.5
17     B  2022-09-08      False      3          3   1    0.5
18     B  2022-09-09      False      3          4   2   -1.0
19     B  2022-09-10      False      2          5   2   -1.0

Here are some explanations of group A:

  • Rows 0,1 and 2 are the first values before (id=-1) the conditioned row (row 3) with slope(x=[-3,-2,-1],y=[2,1,1])=-0.5
  • Rows 4,5 and 6 are the first values after (id=1) the conditioned row (row 3) with slope(x=[1,2,3],y=[3,2,1])=-1
  • Rows 7,8 and 9 are the second values after (id=2) the conditioned row (row 3) with slope(x=[3,4,5],y=[2,4,2])=0

So I was wondering if anyone knows if it is possible to calculate the slopes for every n days with respect to a conditioned row using Pandas?

Asked By: Quinten

||

Answers:

This does the job but I don’t know if there is any fancier pandas way of doing things.

groups=['A','B']
indexs=[]
for i in groups:
    indexs.append(df.loc[(df['group'] == i )& (df['indicator']== True)].index[0])
id2=[]
id3=[]
for i in groups:
    id2=df.loc[(df['group'] == i )].index[:]-indexs[groups.index(i)]
    for j in id2:
        if j < 0:
         id3.append(math.floor(j/n))
        elif j>=0:
         id3.append(math.ceil(j/n))

df['id']=id3

grady=[]
gradx=[]
SlopeList=[]
for i in groups:
    idum=[]
    for number in df['id'].loc[(df['group']==i)]:
        #unique values in list.
        if number not in idum:
            idum.append(number)
    for k in idum:
        grady=df['value'].loc[( df['group'] == i ) &(df['id'] == k ) ]
        gradx=df['diff_days'].loc[ (df['group'] == i )&(df['id'] == k ) ]
        
        Xm=slope(grady.tolist(),gradx.tolist()) #average slope
        for m in range(0,len(gradx)): #create a suitabily sized list with the average slope value.
            SlopeList.append(Xm)
        
df['slope']=SlopeList   
           

p.s. I haven’t done any unit testing on this code, so please check before using it for anything.

Answered By: user6752871

The main idea can be:

  • create individual indexes for each group;
  • align zeros with marked (conditioned) rows;
  • replace indexes with their floor division by n;
  • shift positive indexes one step forward and increment them by 1 to distinguish them from zero points.

After this we can use obtained indexes as an additional grouper to calculate slopes:

# create individual indexing for eash group
id = df.groupby('group')['indicator'].cumcount()

# find positions of the condition rows in the group indexes
offset = id.where(df.indicator).groupby(df.group).first()

# shift the groups indexes so that condition rows are indexed by zero
id = id.groupby(df.group).transform(lambda x: x - offset[x.name])

# transform the group indexes to their floor division by n
# shift those which ware positive by one position forward
# and increment their values by 1
n = 3 
id = (id//n).mask(id>0,(id//n).shift().add(1))

# assign obtained id to a new column
df['id'] = id

# calculate slopes for each `group,id` pair:
grouped_slopes =  df.groupby(['group','id']).apply(lambda g: slope(g.diff_days, g.value))

# add slopes to the data
df = df.join(grouped_slopes , on=['group','id'])

As for the slope calculation, we can use eather of the prepared formulas or make our own. But in any case, we should also distinguish cases when there’s only one item in a group and return 0 for zero points (conditioned rows) and nan for single element tails:

from typing import Literal

def slope(x, y, engine: Literal['numpy', 'scipy']='numpy'):
    from numpy import polyfit
    from scipy.stats import linregress

    match engine:
        case 'numpy':
            func = lambda x, y: polyfit(x, y, 1)[0]
        case 'scipy':
            func = lambda x, y: linregress(x, y).slope
        case other:
            raise ValueError(f'Wrong {engine=}')

    if len(x) > 1:
        return func(x, y)
    if len(x) == 1 and x.iloc[0] == 0:
        return 0
    return float('nan')
Answered By: Vitalizzare
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.