Find weekly leaves aggregate for each partner before a specific date

Question:

I have a leave dataset of partners with leave start date and end date, duration of leaves and Last Working Date (LWD). I need to find the sum of leaves for each partner availed four weeks from LWD grouped in each week interval from LWD. Week1 may be considered 7 days from LWD, week2 as the next 7 days and so on.

EDIT: The aim is to find out the number of leaves each partner availed in each of the last four weeks till their departure from the company

Dataset example below, dates are in dd/mm/yyyy format

enter image description here

I’m looking for an outcome such as:

enter image description here

I understand there would be a groupby followed by datetime.timedelta(days = 7)to get to the dates from LWD but confused as to arrive at the final outcome. Any help appreciated. Please note that the weekly sums are not cumulative, only for the span of the specific week

import pandas as pd
df = pd.DataFrame({'EID':[75161,75162,75162,75162,75162,75166,75166,75166,75169,75170],
                   'START_DATE':['30/08/21','01/10/21','18/06/21','12/11/21','14/06/21','22/04/21','22/07/21','23/08/21','24/08/21','25/10/21'],
                   'END_DATE':['30/08/21','01/10/21','18/06/21','12/11/21','14/06/21','23/04/21','23/07/21','23/08/21','26/08/21','25/10/21'],
                   'LWD':['30/08/21','13/11/21','13/11/21','13/11/21','13/11/21','13/10/21','13/10/21','13/10/21','13/10/21','13/11/21'],
                   'DURATION':[1,1,1,1,1,2,2,1,3,1]
                  })

df['START_DATE'] = pd.to_datetime(df['START_DATE'], infer_datetime_format=True)
df['END_DATE'] = pd.to_datetime(df['END_DATE'], infer_datetime_format=True)
df['LWD'] = pd.to_datetime(df['LWD'], infer_datetime_format=True)
Asked By: Ayan

||

Answers:

The first thing to note about your example is you need to include the dayfirst=True argument to your statements converting date columns to pd.datetime types. as shown below:


df['START_DATE'] = pd.to_datetime(df['START_DATE'], infer_datetime_format=True, dayfirst=True)
df['END_DATE'] = pd.to_datetime(df['END_DATE'], infer_datetime_format=True, dayfirst=True)
df['LWD'] = pd.to_datetime(df['LWD'], infer_datetime_format=True, dayfirst=True)

Once you have made that change your datefields should report a consistent and correct date entry as illustrated below:

df = pd.DataFrame({'EID':[75161,75162,75162,75162,75162,75166,75166,75166,75169,75170],
                   'START_DATE':['30/08/21','01/10/21','18/10/21','12/11/21','14/06/21','22/04/21','22/07/21','23/08/21','24/08/21','25/10/21'],
                   'END_DATE':['30/08/21','01/10/21','18/10/21','12/11/21','14/06/21','23/04/21','23/07/21','23/08/21','26/08/21','25/10/21'],
                   'LWD':['30/08/21','13/11/21','13/11/21','13/11/21','13/11/21','13/10/21','13/10/21','13/10/21','13/10/21','13/11/21'],
                   'DURATION':[1,1,1,1,1,2,2,1,3,1]
                  })

df['START_DATE'] = pd.to_datetime(df['START_DATE'], infer_datetime_format=True, dayfirst=True)
df['END_DATE'] = pd.to_datetime(df['END_DATE'], infer_datetime_format=True, dayfirst=True)
df['LWD'] = pd.to_datetime(df['LWD'], infer_datetime_format=True, dayfirst=True)  

Note: I altered some of your data to add some complexity to the example by having a single ID have leave dates in more than period of interest.

My dataframe looks like:

    EID     START_DATE  END_DATE    LWD         DURATION
0   75161   2021-08-30  2021-08-30  2021-08-30  1
1   75162   2021-10-01  2021-10-01  2021-11-13  1
2   75162   2021-10-18  2021-10-18  2021-11-13  1
3   75162   2021-11-12  2021-11-12  2021-11-13  1
4   75162   2021-06-14  2021-06-14  2021-11-13  1
5   75166   2021-04-22  2021-04-23  2021-10-13  2
6   75166   2021-07-22  2021-07-23  2021-10-13  2
7   75166   2021-08-23  2021-08-23  2021-10-13  1
8   75169   2021-08-24  2021-08-26  2021-10-13  3
9   75170   2021-10-25  2021-10-25  2021-11-13  1  

Now the first step is to add a column which shows the weeks before LWD in which leave has been taken as follows:

#define function to calculate timedelta in weeks between two columns
def week_diff(x: pd.datetime, y:pd.datetime) -> int:
    end = x.dt.to_period('W').view(dtype='int64')
    start = y.dt.to_period('W').view(dtype='int64')
    return end-start  

df['wks_delta'] = week_diff(df['LWD'], df['START_DATE']) 

Results in:

     EID    START_DATE  END_DATE    LWD         DURATION    wks_delta
0   75161   2021-08-30  2021-08-30  2021-08-30  1           0
1   75162   2021-10-01  2021-10-01  2021-11-13  1           6
2   75162   2021-10-18  2021-10-18  2021-11-13  1           3
3   75162   2021-11-12  2021-11-12  2021-11-13  1           0
4   75162   2021-06-14  2021-06-14  2021-11-13  1           21
5   75166   2021-04-22  2021-04-23  2021-10-13  2           25
6   75166   2021-07-22  2021-07-23  2021-10-13  2           12
7   75166   2021-08-23  2021-08-23  2021-10-13  1           7
8   75169   2021-08-24  2021-08-26  2021-10-13  3           7
9   75170   2021-10-25  2021-10-25  2021-11-13  1           2  

We can than filter this dataframe and groupby("EID", ‘wks_delta’) using the following:

df = df[df['wks_delta'] <= 4]
df1 = df.groupby(['EID', 'wks_delta']).sum()
df1.reset_index(inplace=True)  

resulting in:

    EID    wks_delta    DURATION
0   75161   0            1
1   75162   0            1
2   75162   3            1
3   75170   2            1  

The by applying the following:

def computeLeavePeriods(prds: list, df: pd.DataFrame) -> pd.DataFrame:
    row_index = list(df["EID"].unique())
    rows = len(row_index)
    cols = len(prds)
    rslt = [[0]*cols for i in range(rows)]
    for r in range(df.shape[0]):
        rslt[row_index.index(df.iloc[r]['EID'])][df.iloc[r]['wks_delta']] += df.iloc[r]['DURATION']
    return pd.DataFrame(data= rslt, columns=prds, index=row_index)  

computeLeavePeriods(['1-LWD', '2-LWD', '3-LWD', '4-LWD'], df1)  

we get the final result:

      1-LWD 2-LWD   3-LWD   4-LWD
75161   1    0       0       0
75162   1    0       0       1
75170   0    0       1       0  

To handle Duration values which are float, you can modify the computeLeavePeriods function as shown below:

def computeLeavePeriods(prds: list, df: pd.DataFrame) -> pd.DataFrame:
    row_index = list(df["EID"].unique())
    rows = len(row_index)
    cols = len(prds)
    rslt = [[0]*cols for i in range(rows)]
    for r in range(df.shape[0]):
        rslt[row_index.index(df.iloc[r]['EID'])][int(df.iloc[r]['wks_delta'])] += df.iloc[r]['DURATION']
    return pd.DataFrame(data= rslt, columns=prds, index=row_index) 
Answered By: itprorh66
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.