python pandas – group by two columns and find average
Question:
I have a dataframe like this
TxnId TxnDate TxnCount
233 2023-02-01 2
533 2023-02-01 1
433 2023-02-01 4
233 2023-02-02 3
533 2023-02-02 5
233 2023-02-03 3
533 2023-02-03 5
433 2023-02-03 2
I want to compute the average of TxnCount for every TxnId for maximum last 3 days from today and have it in a separate column.
Lets say today = 2023-02-04. I would need the average TxnCount for a TxnId until 2023-02-01. My expected result will be.
TxnId TxnDate TxnCount AVG
233 2023-02-01 2 2
533 2023-02-01 1 1
433 2023-02-01 4 4
233 2023-02-02 3 2.5 [(3+2)/2]
533 2023-02-02 5 3 [(5+1)/2]
233 2023-02-03 3 2.66 [(3+3+2)/3]
533 2023-02-03 5 3.66 [(5+5+1)/3]
433 2023-02-03 2 3 [(2 + 4)/2] Only for two days TxnId is present
Could you please help how to achieve this in python?
Answers:
First replace TxnCount
outside today and 3 previous days (in sample data all data match), then use Series.rolling
per groups with remove MultiIndex
by Series.droplevel
:
df = df.reset_index(drop=True)
df['TxnDate'] = pd.to_datetime(df['TxnDate'])
today = pd.to_datetime('2023-02-04')
s = df['TxnCount'].where(df['TxnDate'].between(today - pd.Timedelta('3 days'), today))
solution 1
#df['AVG'] = s.groupby(df['TxnId']).rolling(3, min_periods=1).mean().droplevel(0)
#solution 2
df['AVG'] = s.groupby(df['TxnId']).rolling(3, min_periods=1).mean().reset_index(0,drop=True)
print (df)
TxnId TxnDate TxnCount AVG
0 233 2023-02-01 2 2.000000
1 533 2023-02-01 1 1.000000
2 433 2023-02-01 4 4.000000
3 233 2023-02-02 3 2.500000
4 533 2023-02-02 5 3.000000
5 233 2023-02-03 3 2.666667
6 533 2023-02-03 5 3.666667
7 433 2023-02-03 2 3.000000
make sure your TxnDate is datetime type, then define 3 days before
threedays = (pd.datetime.now().date() - pd.Timedelta(days = 3)).strftime('%Y-%m-%d')
filter it out
df = df.loc[df['TxnDate'] >= pd.datetime.now().TxnDate()) & (df['TxnDate'] <= '2000-6-10']
then groupby:
temp = df.groupby('TxnId', as_index = False).agg(AVG = ('TxnCount', 'mean')
df.merge(temp, on = ['TxnId'], how = 'inner')
I have a dataframe like this
TxnId TxnDate TxnCount
233 2023-02-01 2
533 2023-02-01 1
433 2023-02-01 4
233 2023-02-02 3
533 2023-02-02 5
233 2023-02-03 3
533 2023-02-03 5
433 2023-02-03 2
I want to compute the average of TxnCount for every TxnId for maximum last 3 days from today and have it in a separate column.
Lets say today = 2023-02-04. I would need the average TxnCount for a TxnId until 2023-02-01. My expected result will be.
TxnId TxnDate TxnCount AVG
233 2023-02-01 2 2
533 2023-02-01 1 1
433 2023-02-01 4 4
233 2023-02-02 3 2.5 [(3+2)/2]
533 2023-02-02 5 3 [(5+1)/2]
233 2023-02-03 3 2.66 [(3+3+2)/3]
533 2023-02-03 5 3.66 [(5+5+1)/3]
433 2023-02-03 2 3 [(2 + 4)/2] Only for two days TxnId is present
Could you please help how to achieve this in python?
First replace TxnCount
outside today and 3 previous days (in sample data all data match), then use Series.rolling
per groups with remove MultiIndex
by Series.droplevel
:
df = df.reset_index(drop=True)
df['TxnDate'] = pd.to_datetime(df['TxnDate'])
today = pd.to_datetime('2023-02-04')
s = df['TxnCount'].where(df['TxnDate'].between(today - pd.Timedelta('3 days'), today))
solution 1
#df['AVG'] = s.groupby(df['TxnId']).rolling(3, min_periods=1).mean().droplevel(0)
#solution 2
df['AVG'] = s.groupby(df['TxnId']).rolling(3, min_periods=1).mean().reset_index(0,drop=True)
print (df)
TxnId TxnDate TxnCount AVG
0 233 2023-02-01 2 2.000000
1 533 2023-02-01 1 1.000000
2 433 2023-02-01 4 4.000000
3 233 2023-02-02 3 2.500000
4 533 2023-02-02 5 3.000000
5 233 2023-02-03 3 2.666667
6 533 2023-02-03 5 3.666667
7 433 2023-02-03 2 3.000000
make sure your TxnDate is datetime type, then define 3 days before
threedays = (pd.datetime.now().date() - pd.Timedelta(days = 3)).strftime('%Y-%m-%d')
filter it out
df = df.loc[df['TxnDate'] >= pd.datetime.now().TxnDate()) & (df['TxnDate'] <= '2000-6-10']
then groupby:
temp = df.groupby('TxnId', as_index = False).agg(AVG = ('TxnCount', 'mean')
df.merge(temp, on = ['TxnId'], how = 'inner')