Applying custom functions to groupby objects pandas

Question:

I have the following pandas dataframe.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "bird_type": ["falcon", "crane", "crane", "falcon"],
        "avg_speed": [np.random.randint(50, 200) for _ in range(4)],
        "no_of_birds_observed": [np.random.randint(3, 10) for _ in range(4)],
        "reliability_of_data": [np.random.rand() for _ in range(4)],
    }
)

# The dataframe looks like this. 
   bird_type    avg_speed   no_of_birds_observed    reliability_of_data
0   falcon        66            3                       0.553841
1   crane         159           8                       0.472359
2   crane         158           7                       0.493193
3   falcon        161           7                       0.585865

Now, I would like to have the weighted average (according to the number_of_birds_surveyed) for the average_speed and reliability variables. For that I have a simple function as follows, which calculates the weighted average.

def func(data, numbers):
    ans = 0
    for a, b in zip(data, numbers):
        ans = ans + a*b
    ans = ans / sum(numbers)
    return ans

How can I apply the function of func to both average speed and reliability variables?

I expect the answer to be a dataframe like follows

    bird_type   avg_speed        no_of_birds_observed  reliability_of_data
0   falcon      132.5                 10                   0.5762578   
# how       (66*3 + 161*7)/(3+7)    (3+10)     (0.553841×3+0.585865×7)/(3+7)
1   crane       158.53                15                   0.4820815
# how      (159*8 + 158*7)/(8+7)    (8+7)     (0.472359×8+0.493193×7)/(8+7)

I saw this question, but could not generalize the solution / understand it completely. I thought of not asking the question, but according to this blog post by SO and this meta question, with a different example, I think this question can be considered a "borderline duplicate". An answer will benefit me and probably some others will also find this useful. So finally decided to ask.

Asked By: berinaniesh

||

Answers:

If want aggregate by GroupBy.agg for weights parameter is used no_of_birds_observed by DataFrame.loc:

#for correct ouput need default (or unique values) index
df = df.reset_index(drop=True)


f = lambda x: np.average(x,  weights= df.loc[x.index, 'no_of_birds_observed'])
df1 = (df.groupby('bird_type', sort=False, as_index=False)
          .agg(avg=('avg_speed',f),
               no_of_birds=('no_of_birds_observed','sum'),
               reliability_of_data=('reliability_of_data', f)))
print (df1)
  bird_type         avg  no_of_birds  reliability_of_data
0    falcon  132.500000           10             0.576258
1     crane  158.533333           15             0.482082
Answered By: jezrael

Don’t use a function with apply, rather perform a classical aggregation:

cols = ['avg_speed', 'reliability_of_data']

# multiply relevant columns by no_of_birds_observed
# aggregate everything as sum
out = (df[cols].mul(df['no_of_birds_observed'], axis=0)
       .combine_first(df)
       .groupby('bird_type').sum()
      )

# divide the relevant columns by the sum of no_of_birds_observed
out[cols] = out[cols].div(out['no_of_birds_observed'], axis=0)

Output:

            avg_speed  no_of_birds_observed  reliability_of_data
bird_type                                                       
crane      158.533333                    15             0.482082
falcon     132.500000                    10             0.576258
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.