Iterating on group of columns in a dataframe from custom list – pandas
Question:
I have a dataframe df like this
TxnId TxnDate TxnCount
100 2023-02-01 2
500 2023-02-01 1
400 2023-02-01 4
100 2023-02-02 3
500 2023-02-02 5
100 2023-02-03 3
500 2023-02-03 5
400 2023-02-03 2
I have the following custom lists
datelist = [datetime.date(2023,02,03), datetime.date(2023,02,02)]
txnlist = [400,500]
I want to iterate the df as per below logic:
for every txn in txnlist:
sum = 0
for every date in datelist:
sum += df[txn][date].TxnCount
I would also be interested to understand how to find average of TxnCount for filtered TxnIds.
After Sum step based on above input and filters:
TxnId TxnCount
400 2
500 10
Average corresponding to TxnId 400 = (2+0)/2 = 1
Average corresponding to TxnId 500 = (5+5)/2 = 5
If average > 3 , add row from dataframe to breachList
breachList =[[500,10]]
Please help me how to do this in pandas
Answers:
Filter DataFrame by both lists first by boolean indexing
with Series.isin
:
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
And then for sum of column TxnCount
per groups:
out = df1.groupby('TxnId', as_index=False)['TxnCount'].sum()
print (out)
TxnId TxnCount
0 400 2
1 500 10
EDIT: If need filter TxnId
by average, here greater like 4
use:
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
#create averages per TxnId
out = df1.groupby('TxnId')['TxnCount'].mean()
print (out)
TxnId
400 2
500 5
Name: TxnCount, dtype: int64
#get TxnId greater like 4
TxnId = out[out > 4].index
print (TxnId)
Int64Index([500], dtype='int64', name='TxnId')
Filter rows in df
or df1
:
df2 = df[df['TxnId'].isin(TxnId)]
print(df2)
TxnId TxnDate TxnCount
1 500 2023-02-01 1
4 500 2023-02-02 5
6 500 2023-02-03 5
df3 = df1[df1['TxnId'].isin(TxnId)]
print(df3)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
EDIT1: For expected ouput use:
First filter by lists (for avoid processig all rows):
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
Pivoting for all combinations TxnDate/TxnId
:
out = df1.pivot_table(index='TxnId',
columns='TxnDate',
values='TxnCount',
aggfunc='sum',
fill_value=0)
print (out)
TxnDate 2023-02-02 2023-02-03
TxnId
400 0 2
500 5 5
Last filtered summed values by means per rows and convert to lists:
breachList = out.sum(axis=1)[out.mean(axis=1).gt(3)].reset_index().to_numpy().tolist()
print (breachList)
[[500, 10]]
The fact that your are using a nested loop is reminiscent of a 2D pivot_table
(or crosstab
):
df['TxnDate'] = pd.to_datetime(df['TxnDate'])
out = (df.pivot_table(index='TxnId', columns='TxnDate',
values='TxnCount', aggfunc='sum'
fill_value=0)
.reindex(txnlist, datelist)
)
Output:
TxnDate 2023-02-03 2023-02-02
TxnId
400 2 0
500 5 5
And if you want to further aggregate on Ids (or Date):
out.sum(axis=1)
TxnId
400 2
500 10
dtype: int64
I have a dataframe df like this
TxnId TxnDate TxnCount
100 2023-02-01 2
500 2023-02-01 1
400 2023-02-01 4
100 2023-02-02 3
500 2023-02-02 5
100 2023-02-03 3
500 2023-02-03 5
400 2023-02-03 2
I have the following custom lists
datelist = [datetime.date(2023,02,03), datetime.date(2023,02,02)]
txnlist = [400,500]
I want to iterate the df as per below logic:
for every txn in txnlist:
sum = 0
for every date in datelist:
sum += df[txn][date].TxnCount
I would also be interested to understand how to find average of TxnCount for filtered TxnIds.
After Sum step based on above input and filters:
TxnId TxnCount
400 2
500 10
Average corresponding to TxnId 400 = (2+0)/2 = 1
Average corresponding to TxnId 500 = (5+5)/2 = 5
If average > 3 , add row from dataframe to breachList
breachList =[[500,10]]
Please help me how to do this in pandas
Filter DataFrame by both lists first by boolean indexing
with Series.isin
:
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
And then for sum of column TxnCount
per groups:
out = df1.groupby('TxnId', as_index=False)['TxnCount'].sum()
print (out)
TxnId TxnCount
0 400 2
1 500 10
EDIT: If need filter TxnId
by average, here greater like 4
use:
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
#create averages per TxnId
out = df1.groupby('TxnId')['TxnCount'].mean()
print (out)
TxnId
400 2
500 5
Name: TxnCount, dtype: int64
#get TxnId greater like 4
TxnId = out[out > 4].index
print (TxnId)
Int64Index([500], dtype='int64', name='TxnId')
Filter rows in df
or df1
:
df2 = df[df['TxnId'].isin(TxnId)]
print(df2)
TxnId TxnDate TxnCount
1 500 2023-02-01 1
4 500 2023-02-02 5
6 500 2023-02-03 5
df3 = df1[df1['TxnId'].isin(TxnId)]
print(df3)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
EDIT1: For expected ouput use:
First filter by lists (for avoid processig all rows):
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
Pivoting for all combinations TxnDate/TxnId
:
out = df1.pivot_table(index='TxnId',
columns='TxnDate',
values='TxnCount',
aggfunc='sum',
fill_value=0)
print (out)
TxnDate 2023-02-02 2023-02-03
TxnId
400 0 2
500 5 5
Last filtered summed values by means per rows and convert to lists:
breachList = out.sum(axis=1)[out.mean(axis=1).gt(3)].reset_index().to_numpy().tolist()
print (breachList)
[[500, 10]]
The fact that your are using a nested loop is reminiscent of a 2D pivot_table
(or crosstab
):
df['TxnDate'] = pd.to_datetime(df['TxnDate'])
out = (df.pivot_table(index='TxnId', columns='TxnDate',
values='TxnCount', aggfunc='sum'
fill_value=0)
.reindex(txnlist, datelist)
)
Output:
TxnDate 2023-02-03 2023-02-02
TxnId
400 2 0
500 5 5
And if you want to further aggregate on Ids (or Date):
out.sum(axis=1)
TxnId
400 2
500 10
dtype: int64