Python pandas groupby aggregate on multiple columns, then pivot
Question:
In Python, I have a pandas DataFrame similar to the following:
Item | shop1 | shop2 | shop3 | Category
------------------------------------
Shoes| 45 | 50 | 53 | Clothes
TV | 200 | 300 | 250 | Technology
Book | 20 | 17 | 21 | Books
phone| 300 | 350 | 400 | Technology
Where shop1, shop2 and shop3 are the costs of every item in different shops.
Now, I need to return a DataFrame, after some data cleaning, like this one:
Category (index)| size| sum| mean | std
----------------------------------------
where size is the number of items in each Category and sum, mean and std are related to the same functions applied to the 3 shops. How can I do these operations with the split-apply-combine pattern (groupby, aggregate, apply,…) ?
Can someone help me out? I’m going crazy with this one…thank you!
Answers:
df.groupby('Category').agg({'Item':'size','shop1':['sum','mean','std'],'shop2':['sum','mean','std'],'shop3':['sum','mean','std']})
Or if you want it across all shops then:
df1 = df.set_index(['Item','Category']).stack().reset_index().rename(columns={'level_2':'Shops',0:'costs'})
df1.groupby('Category').agg({'Item':'size','costs':['sum','mean','std']})
If I understand correctly, you want to calculate aggregate metrics for all shops, not for each individually. To do that, you can first stack
your dataframe and then group by Category
:
stacked = df.set_index(['Item', 'Category']).stack().reset_index()
stacked.columns = ['Item', 'Category', 'Shop', 'Price']
stacked.groupby('Category').agg({'Price':['count','sum','mean','std']})
Which results in
Price
count sum mean std
Category
Books 3 58 19.333333 2.081666
Clothes 3 148 49.333333 4.041452
Technology 6 1800 300.000000 70.710678
Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation.
We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.
rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category')
.agg(rnm_cols.keys()).rename(columns=rnm_cols)
Size Sum Mean Std
Category
Books 3 58 19.333333 2.081666
Clothes 3 148 49.333333 4.041452
Technology 6 1800 300.000000 70.710678
option 1
use agg
← link to docs
agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)
Std Sum Mean Size
Category
Books 2.081666 58 19.333333 3
Clothes 4.041452 148 49.333333 3
Technology 70.710678 1800 300.000000 6
option 2
more for less
use describe
← link to docs
df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()
count mean std min 25% 50% 75% max
Category
Books 3.0 19.333333 2.081666 17.0 18.5 20.0 20.5 21.0
Clothes 3.0 49.333333 4.041452 45.0 47.5 50.0 51.5 53.0
Technology 6.0 300.000000 70.710678 200.0 262.5 300.0 337.5 400.0
In Python, I have a pandas DataFrame similar to the following:
Item | shop1 | shop2 | shop3 | Category
------------------------------------
Shoes| 45 | 50 | 53 | Clothes
TV | 200 | 300 | 250 | Technology
Book | 20 | 17 | 21 | Books
phone| 300 | 350 | 400 | Technology
Where shop1, shop2 and shop3 are the costs of every item in different shops.
Now, I need to return a DataFrame, after some data cleaning, like this one:
Category (index)| size| sum| mean | std
----------------------------------------
where size is the number of items in each Category and sum, mean and std are related to the same functions applied to the 3 shops. How can I do these operations with the split-apply-combine pattern (groupby, aggregate, apply,…) ?
Can someone help me out? I’m going crazy with this one…thank you!
df.groupby('Category').agg({'Item':'size','shop1':['sum','mean','std'],'shop2':['sum','mean','std'],'shop3':['sum','mean','std']})
Or if you want it across all shops then:
df1 = df.set_index(['Item','Category']).stack().reset_index().rename(columns={'level_2':'Shops',0:'costs'})
df1.groupby('Category').agg({'Item':'size','costs':['sum','mean','std']})
If I understand correctly, you want to calculate aggregate metrics for all shops, not for each individually. To do that, you can first stack
your dataframe and then group by Category
:
stacked = df.set_index(['Item', 'Category']).stack().reset_index()
stacked.columns = ['Item', 'Category', 'Shop', 'Price']
stacked.groupby('Category').agg({'Price':['count','sum','mean','std']})
Which results in
Price
count sum mean std
Category
Books 3 58 19.333333 2.081666
Clothes 3 148 49.333333 4.041452
Technology 6 1800 300.000000 70.710678
Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation.
We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.
rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category')
.agg(rnm_cols.keys()).rename(columns=rnm_cols)
Size Sum Mean Std
Category
Books 3 58 19.333333 2.081666
Clothes 3 148 49.333333 4.041452
Technology 6 1800 300.000000 70.710678
option 1
use agg
← link to docs
agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)
Std Sum Mean Size
Category
Books 2.081666 58 19.333333 3
Clothes 4.041452 148 49.333333 3
Technology 70.710678 1800 300.000000 6
option 2
more for less
use describe
← link to docs
df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()
count mean std min 25% 50% 75% max
Category
Books 3.0 19.333333 2.081666 17.0 18.5 20.0 20.5 21.0
Clothes 3.0 49.333333 4.041452 45.0 47.5 50.0 51.5 53.0
Technology 6.0 300.000000 70.710678 200.0 262.5 300.0 337.5 400.0