Get count_values(normalize=True) result for each value, in each group in pandas GroupByDataframe

Question:

Say I have a dataframe

import pandas as pd
df = pd.DataFrame({"id":[1,1,1,1,2,2,2],
                  "car":["Volvo","Audi","Volvo","Volvo","VW","Audi","Volvo"]})

and I want to get the count-ratio for each car within each id aswell as the count i.e the resulting dataframe would be


          ratio    count
    car
id 

1   Audi   0.25     1
    Volvo  0.75     3

2   Audi   0.33     1
    Volvo  0.33     1
    VW     0.33     1

I have tried


def get_tag_ratio(x):
    """
    Calculate the ratio for each car
    """
    ratio = x.value_counts(normalize=True)
    return ratio

car_info = df.groupby(["id"])["car"].agg(
    ratio=get_tag_ratio, count="count")

but the result is

          ratio    count
 
id 

1   [0.75,0.25]          4

2   [0.33,0.33,0.33]     3

I have also tried

car_info = df.groupby(["id","car"])["car"].agg(
    ratio=get_tag_ratio, count="count")

which gives

          ratio    count
    
id   car

1   Audi   1.0      1
    Volvo  1.0      3

2   Audi   1.0      1
    Volvo  1.0      1
    VW     1.0      1

which is almost there – unless the value_counts now is applied within each car group and not each id group.

Just parsing "value_counts" doens’t work (also, I don’t know how to get "normalize=True" as an argument to value_counts here)

df.groupby("id")["car"].agg(["value_counts","count"]) #`ValueError: Must pass non-zero number of levels/codes`

Is there a way to just to it in the .agg call, or do I have to join another dataframe with the value_counts? I have quite a lot of id and car so the groupby operations a rather timeconsuming, thus I prefer only having to do one "call" of groupby.agg

Asked By: CutePoison

||

Answers:

Use Series.value_counts without parameter and with parameter normalize with DataFrame cosntructor:

def get_tag_ratio(x):
    """
    Calculate the ratio for each car
    """
    ratio = pd.DataFrame({'ratio': x.value_counts(normalize=True), 
                          'count': x.value_counts()})
    return ratio

car_info = df.groupby(["id"])["car"].apply(get_tag_ratio)
print (car_info)
             ratio  count
id                       
1  Volvo  0.750000      3
   Audi   0.250000      1
2  Audi   0.333333      1
   VW     0.333333      1
   Volvo  0.333333      1

Or join 2 MultiIndex Series:

g = df.groupby("id")["car"]
car_info = pd.concat([g.value_counts(normalize=True),
                      g.value_counts()], axis=1, keys=['ratio','count'])
print (car_info)
             ratio  count
id car                   
1  Volvo  0.750000      3
   Audi   0.250000      1
2  Audi   0.333333      1
   VW     0.333333      1
   Volvo  0.333333      1
Answered By: jezrael

One option with a double groupby.value_counts:

g = df.groupby('id')['car']
out = (g
 .value_counts().to_frame(name='count')
 .assign(ratio=g.value_counts(normalize=True))
)

Alternative, using a manual calculation:

s = df.groupby(['id', 'car']).size()
out = pd.concat({'count': s,
                 'ratio': s.div(s.groupby(level='id').transform('sum'))},
                axis=1)

output:

          count     ratio
id car                   
1  Volvo      3  0.750000
   Audi       1  0.250000
2  Audi       1  0.333333
   VW         1  0.333333
   Volvo      1  0.333333
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.