pandas group data at 3 month intervals and aggregate list of functions

Question

I have a dataframe like as shown below

df = pd.DataFrame({'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
                   'invoice_id':[1,2,3,4,5,6,7,8,9,10,11,12],
                   'purchase_date' :['2017-04-03 12:35:00','2017-04-03 12:50:00','2018-04-05 12:59:00','2018-05-04 13:14:00','2017-05-05 13:37:00','2018-07-06 13:39:00','2018-07-08 11:30:00','2017-04-08 16:00:00','2019-04-09 22:00:00','2019-04-11 04:00:00','2018-04-13 04:30:00','2017-04-14 08:00:00'],
                   'val' :[5,5,5,5,1,6,5,5,8,3,4,6],
                   'Prod_id':['A1','A1','C1','A1','E1','Q1','G1','F1','G1','H1','J1','A1']})
df['purchase_date'] = pd.to_datetime(df['purchase_date'])

I would like to do the below

a) group the data by subject_id at 3 month intervals (using purchase date column)

b) Compute statistics such as mean, sum, nunique and count for each of the group based on their other variables such as prod_id, val etc

For ex: the earliest purchase date in df starts at 2017-04-03. Here starting month in the dataset is April. So, we count 3 months from Apr. So, APR, MAY and JUN will be M1 and July, Aug and Sept will be M2 and so on. We do this to create data at 3-month intervals.
whenever there is no data for in between for 3 months, we put that as zero (0)

So, I tried something like below based on online research

    length_of_purchase_date = 10
    
    date_month_map = {
        str(x)[:length_of_purchase_date]: 'M_%s' % (i+1) for i, x in enumerate(
            sorted(data_df.reset_index()['purchase_date'].unique(), reverse=True)
        )
    } 
    df.reset_index().groupby(['subject_id',
    pd.Grouper(key='pruchase_date', freq='3M')
]).agg({
    'val': [sum, mean,count],
})

I expect my output to be like as shown below (shown for subject_id = 1). Please note that I have to do this on big data with millions of rows.

Asked By: The Great

||

Source

Answer 1

Use:

#sorting per subject_id, purchase_date
df = df.sort_values(['subject_id','purchase_date'])

#create month groups by convert to month periods with subtract min values
per = df['purchase_date'].dt.to_period('m').astype('int')
df['date_group'] = (per.sub(per.min()) // 3 + 1)

#custom function for first index if count greater like 1 else NaN
def f(x):
    y = x.value_counts()
    return next(iter(y.index[y.gt(1)]), np.nan)

#aggregate per both columns
df = df.groupby(['subject_id', 'date_group']).agg(max_date=('purchase_date','max'),
                                                  nunique=('Prod_id','nunique'),
                                                  count_prod_id=('Prod_id','count'),
                                                  sum_val=('val','sum'),
                                                  avg_val=('val','mean'),
                                                  min_val=('val','min'),
                                                  max_val=('val','max'),
                                                  Top1st_prod_id=('Prod_id',f))

#dictionary for fillna by 0
d = dict.fromkeys(df.columns.difference(['max_date','Top1st_prod_id']), 0)
#add missing date_group with reindex 
df = (df.reset_index(level=0)
         .groupby('subject_id')
         .apply(lambda x: x.reindex(range(1, x.index.max() + 1)))
         .fillna(d)) 

#custom formating of columns
df['max_date'] = df['max_date'].dt.strftime('%d-%b-%y')
df = df.drop('subject_id', axis=1).reset_index()
df['date_group'] = 'M' + df['date_group'].astype(str)

print (df)
    subject_id date_group   max_date  nunique  count_prod_id  sum_val  
0            1         M1  05-May-17      2.0            3.0     11.0   
1            1         M2        NaN      0.0            0.0      0.0   
2            1         M3        NaN      0.0            0.0      0.0   
3            1         M4        NaN      0.0            0.0      0.0   
4            1         M5  04-May-18      2.0            2.0     10.0   
5            1         M6  08-Jul-18      2.0            2.0     11.0   
6            2         M1  14-Apr-17      2.0            2.0     11.0   
7            2         M2        NaN      0.0            0.0      0.0   
8            2         M3        NaN      0.0            0.0      0.0   
9            2         M4        NaN      0.0            0.0      0.0   
10           2         M5  13-Apr-18      1.0            1.0      4.0   
11           2         M6        NaN      0.0            0.0      0.0   
12           2         M7        NaN      0.0            0.0      0.0   
13           2         M8        NaN      0.0            0.0      0.0   
14           2         M9  11-Apr-19      2.0            2.0     11.0   

     avg_val  min_val  max_val Top1st_prod_id  
0   3.666667      1.0      5.0             A1  
1   0.000000      0.0      0.0            NaN  
2   0.000000      0.0      0.0            NaN  
3   0.000000      0.0      0.0            NaN  
4   5.000000      5.0      5.0            NaN  
5   5.500000      5.0      6.0            NaN  
6   5.500000      5.0      6.0            NaN  
7   0.000000      0.0      0.0            NaN  
8   0.000000      0.0      0.0            NaN  
9   0.000000      0.0      0.0            NaN  
10  4.000000      4.0      4.0            NaN  
11  0.000000      0.0      0.0            NaN  
12  0.000000      0.0      0.0            NaN  
13  0.000000      0.0      0.0            NaN  
14  5.500000      3.0      8.0            NaN

Answered By: jezrael

pandas group data at 3 month intervals and aggregate list of functions

Question:

Answers: