How to create a multi-index pivot table that sums the max values within a sub-group
Question:
I have a somewhat large dataframe of customers assigned to a hub and each hub is in a specific location. The hubs get flagged whenever there’s an issue and I’d like to know the number of customers affected each time this happens.
So I’d like to find the max number of customers assigned to each hub (this would then exclude the times the hub may have been flagged multiple times) and then group the rows by location and the columns by type, then show the sum of the max count of customers over a period of months.
The data looks like:
Hub
Location
DateTime
Month
Type
Customers
J01
NY
01/01/2022
January
Type 1
250
J03
CA
01/21/2022
January
Type 2
111
J01
NY
04/01/2022
April
Type 1
250
J05
CA
06/01/2022
June
Type 1
14
J03
CA
08/18/2022
August
Type 2
111
I did the following code to generate a pivot table and it generates the max values for each hub, but there are hundreds of hubs.
` pd.pivot_table (out,values='Customers',index=['Location','Hub'], columns=
['Type','Month'],aggfunc='max') `
Results mostly look like:
Type
Type 1
Type 2
Month
January
February
March
January
Location
Hub
NA
NY
J01
0
250
250
NA
J04
222
222
222
NA
CA
J03
NA
NA
NA
111
CA
J05
14
14
0
NA
I would like the results to look like:
Type
Type 1
Type 2
Month
January
February
March
January
Location
NY
222
472
472
0
CA
14
14
0
111
Is there an easier way to achieve this?
Answers:
You’re on the right start! pivot_table
is the right way to group the table with columns by type. You also identified that you can perform the max
aggregation at pivot-time:
df = pd.pivot_table(
out,
values='Customers',
index=['Location','Hub'],
columns=['Type','Month'],
aggfunc='max'
)
Next, we need to groupby Location
and add (sum) the values:
df = df.groupby("Location").max()
The final result (trying to build a table that matches your data):
out = pd.read_csv(io.StringIO("""
Hub Location DateTime Month Type Customers
J01 NY 01/01/2022 January Type 1 250
J03 CA 01/21/2022 January Type 2 111
J01 NY 04/01/2022 April Type 1 250
J04 NY 01/01/2022 January Type 1 222
J04 NY 02/01/2022 February Type 1 222
J04 NY 04/01/2022 April Type 1 222
J05 CA 06/01/2022 June Type 1 14
J03 CA 08/18/2022 August Type 2 111
"""), sep='t')
pd.pivot_table(
out,
values='Customers',
index=['Location','Hub'],
columns=['Type','Month'],
aggfunc='max'
).groupby("Location").sum()
gives:
Type Type 1 Type 2
Month April February January June August January
Location
CA 0.0 0.0 0.0 14.0 111.0 111.0
NY 472.0 222.0 472.0 0.0 0.0 0.0
I have a somewhat large dataframe of customers assigned to a hub and each hub is in a specific location. The hubs get flagged whenever there’s an issue and I’d like to know the number of customers affected each time this happens.
So I’d like to find the max number of customers assigned to each hub (this would then exclude the times the hub may have been flagged multiple times) and then group the rows by location and the columns by type, then show the sum of the max count of customers over a period of months.
The data looks like:
Hub | Location | DateTime | Month | Type | Customers |
---|---|---|---|---|---|
J01 | NY | 01/01/2022 | January | Type 1 | 250 |
J03 | CA | 01/21/2022 | January | Type 2 | 111 |
J01 | NY | 04/01/2022 | April | Type 1 | 250 |
J05 | CA | 06/01/2022 | June | Type 1 | 14 |
J03 | CA | 08/18/2022 | August | Type 2 | 111 |
I did the following code to generate a pivot table and it generates the max values for each hub, but there are hundreds of hubs.
` pd.pivot_table (out,values='Customers',index=['Location','Hub'], columns=
['Type','Month'],aggfunc='max') `
Results mostly look like:
Type | Type 1 | Type 2 | |||
---|---|---|---|---|---|
Month | January | February | March | January | |
Location | Hub | NA | |||
NY | J01 | 0 | 250 | 250 | NA |
J04 | 222 | 222 | 222 | NA | |
CA | J03 | NA | NA | NA | 111 |
CA | J05 | 14 | 14 | 0 | NA |
I would like the results to look like:
Type | Type 1 | Type 2 | |||
---|---|---|---|---|---|
Month | January | February | March | January | |
Location | |||||
NY | 222 | 472 | 472 | 0 | |
CA | 14 | 14 | 0 | 111 |
Is there an easier way to achieve this?
You’re on the right start! pivot_table
is the right way to group the table with columns by type. You also identified that you can perform the max
aggregation at pivot-time:
df = pd.pivot_table(
out,
values='Customers',
index=['Location','Hub'],
columns=['Type','Month'],
aggfunc='max'
)
Next, we need to groupby Location
and add (sum) the values:
df = df.groupby("Location").max()
The final result (trying to build a table that matches your data):
out = pd.read_csv(io.StringIO("""
Hub Location DateTime Month Type Customers
J01 NY 01/01/2022 January Type 1 250
J03 CA 01/21/2022 January Type 2 111
J01 NY 04/01/2022 April Type 1 250
J04 NY 01/01/2022 January Type 1 222
J04 NY 02/01/2022 February Type 1 222
J04 NY 04/01/2022 April Type 1 222
J05 CA 06/01/2022 June Type 1 14
J03 CA 08/18/2022 August Type 2 111
"""), sep='t')
pd.pivot_table(
out,
values='Customers',
index=['Location','Hub'],
columns=['Type','Month'],
aggfunc='max'
).groupby("Location").sum()
gives:
Type Type 1 Type 2
Month April February January June August January
Location
CA 0.0 0.0 0.0 14.0 111.0 111.0
NY 472.0 222.0 472.0 0.0 0.0 0.0