python pandas – group by two columns and find average

Question:

I have a dataframe like this

TxnId     TxnDate           TxnCount
  233     2023-02-01      2
  533     2023-02-01      1
  433     2023-02-01      4
  233     2023-02-02      3
  533     2023-02-02      5
  233     2023-02-03      3
  533     2023-02-03      5
  433     2023-02-03      2

I want to compute the average of TxnCount for every TxnId for maximum last 3 days from today and have it in a separate column.

Lets say today = 2023-02-04. I would need the average TxnCount for a TxnId until 2023-02-01. My expected result will be.

TxnId     TxnDate           TxnCount     AVG
  233     2023-02-01      2            2
  533     2023-02-01      1            1
  433     2023-02-01      4            4  
  233     2023-02-02      3            2.5  [(3+2)/2]  
  533     2023-02-02      5            3    [(5+1)/2]   
  233     2023-02-03      3            2.66 [(3+3+2)/3]           
  533     2023-02-03      5            3.66 [(5+5+1)/3]  
  433     2023-02-03      2            3    [(2 + 4)/2] Only for two days TxnId is present

Could you please help how to achieve this in python?

Asked By: KurinchiMalar

||

Answers:

First replace TxnCount outside today and 3 previous days (in sample data all data match), then use Series.rolling per groups with remove MultiIndex by Series.droplevel:

df = df.reset_index(drop=True)

df['TxnDate'] = pd.to_datetime(df['TxnDate'])

today = pd.to_datetime('2023-02-04')

s = df['TxnCount'].where(df['TxnDate'].between(today - pd.Timedelta('3 days'), today))

solution 1
#df['AVG'] = s.groupby(df['TxnId']).rolling(3, min_periods=1).mean().droplevel(0)
#solution 2
df['AVG'] = s.groupby(df['TxnId']).rolling(3, min_periods=1).mean().reset_index(0,drop=True)
print (df)
   TxnId    TxnDate  TxnCount       AVG
0    233 2023-02-01         2  2.000000
1    533 2023-02-01         1  1.000000
2    433 2023-02-01         4  4.000000
3    233 2023-02-02         3  2.500000
4    533 2023-02-02         5  3.000000
5    233 2023-02-03         3  2.666667
6    533 2023-02-03         5  3.666667
7    433 2023-02-03         2  3.000000
Answered By: jezrael

make sure your TxnDate is datetime type, then define 3 days before

threedays = (pd.datetime.now().date() - pd.Timedelta(days = 3)).strftime('%Y-%m-%d')

filter it out

df = df.loc[df['TxnDate'] >= pd.datetime.now().TxnDate()) & (df['TxnDate'] <= '2000-6-10']

then groupby:

temp = df.groupby('TxnId', as_index = False).agg(AVG = ('TxnCount', 'mean')
df.merge(temp, on = ['TxnId'], how = 'inner')
Answered By: nabilahnran
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.