groupby weighted average and sum in pandas dataframe
Question:
I have a dataframe:
Out[78]:
contract month year buys adjusted_lots price
0 W Z 5 Sell -5 554.85
1 C Z 5 Sell -3 424.50
2 C Z 5 Sell -2 424.00
3 C Z 5 Sell -2 423.75
4 C Z 5 Sell -3 423.50
5 C Z 5 Sell -2 425.50
6 C Z 5 Sell -3 425.25
7 C Z 5 Sell -2 426.00
8 C Z 5 Sell -2 426.75
9 CC U 5 Buy 5 3328.00
10 SB V 5 Buy 5 11.65
11 SB V 5 Buy 5 11.64
12 SB V 5 Buy 2 11.60
I need a sum of adjusted_lots , price which is weighted average , of price and adjusted_lots , grouped by all the other columns , ie. grouped by (contract, month , year and buys)
Similar solution on R was achieved by following code, using dplyr, however unable to do the same in pandas.
> newdf = df %>%
select ( contract , month , year , buys , adjusted_lots , price ) %>%
group_by( contract , month , year , buys) %>%
summarise(qty = sum( adjusted_lots) , avgpx = weighted.mean(x = price , w = adjusted_lots) , comdty = "Comdty" )
> newdf
Source: local data frame [4 x 6]
contract month year comdty qty avgpx
1 C Z 5 Comdty -19 424.8289
2 CC U 5 Comdty 5 3328.0000
3 SB V 5 Comdty 12 11.6375
4 W Z 5 Comdty -5 554.8500
is the same possible by groupby or any other solution ?
Answers:
EDIT: update aggregation so it works with recent version of pandas
To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:
# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])
# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)
# Groupby and aggregate with namedAgg [1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),
price_weighted_mean=("price", wm))
adjusted_lots price_weighted_mean
contract month year buys
C Z 5 Sell -19 424.828947
CC U 5 Buy 5 3328.000000
SB V 5 Buy 12 11.637500
W Z 5 Sell -5 554.850000
You can see more here:
and in a similar question here:
Doing weighted average by groupby(…).apply(…) can be very slow (100x from the following).
See my answer (and others) on this thread.
def weighted_average(df,data_col,weight_col,by_col):
df['_data_times_weight'] = df[data_col]*df[weight_col]
df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
g = df.groupby(by_col)
result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
del df['_data_times_weight'], df['_weight_where_notnull']
return result
The solution that uses a dict of aggregation functions will be deprecated in a future version of pandas (version 0.22):
FutureWarning: using a dict with renaming is deprecated and will be removed in a future
version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Use a groupby apply and return a Series to rename columns as discussed in:
Rename result columns from Pandas aggregation ("FutureWarning: using a dict with renaming is deprecated")
def my_agg(x):
names = {'weighted_ave_price': (x['adjusted_lots'] * x['price']).sum()/x['adjusted_lots'].sum()}
return pd.Series(names, index=['weighted_ave_price'])
produces the same result:
>df.groupby(["contract", "month", "year", "buys"]).apply(my_agg)
weighted_ave_price
contract month year buys
C Z 5 Sell 424.828947
CC U 5 Buy 3328.000000
SB V 5 Buy 11.637500
W Z 5 Sell 554.850000
Wouldn’t it be a lot more simpler to do this.
- Multiply (adjusted_lots * price_weighted_mean) into a new column "X"
- Use groupby().sum() for columns "X" and "adjusted_lots" to get
grouped df df_grouped
- Compute weighted average on the df_grouped as
df_grouped[‘X’]/df_grouped[‘adjusted_lots’]
With datar
, you don’t have to learn pandas APIs to transition your R code:
>>> from datar.all import f, tibble, c, rep, select, summarise, sum, weighted_mean, group_by
>>> df = tibble(
... contract=c('W', rep('C', 8), 'CC', rep('SB', 3)),
... month=c(rep('Z', 9), 'U', rep('V', 3)),
... year=5,
... buys=c(rep('Sell', 9), rep('Buy', 4)),
... adjusted_lots=[-5, -3, -2, -2, -3, -2, -3, -2, -2, 5, 5, 5, 2],
... price=[554.85, 424.50, 424.00, 423.75, 423.50, 425.50, 425.25, 426.00, 426.75,3328.00, 11.65, 11.64, 1
1.60]
... )
>>> df
contract month year buys adjusted_lots price
0 W Z 5 Sell -5 554.85
1 C Z 5 Sell -3 424.50
2 C Z 5 Sell -2 424.00
3 C Z 5 Sell -2 423.75
4 C Z 5 Sell -3 423.50
5 C Z 5 Sell -2 425.50
6 C Z 5 Sell -3 425.25
7 C Z 5 Sell -2 426.00
8 C Z 5 Sell -2 426.75
9 CC U 5 Buy 5 3328.00
10 SB V 5 Buy 5 11.65
11 SB V 5 Buy 5 11.64
12 SB V 5 Buy 2 11.60
>>> newdf = df >>
... select(f.contract, f.month, f.year, f.buys, f.adjusted_lots, f.price) >>
... group_by(f.contract, f.month, f.year, f.buys) >>
... summarise(
... qty = sum(f.adjusted_lots),
... avgpx = weighted_mean(x = f.price , w = f.adjusted_lots),
... comdty = "Comdty"
... )
[2021-05-24 13:11:03][datar][ INFO] `summarise()` has grouped output by ['contract', 'month', 'year'] (overr
ide with `_groups` argument)
>>>
>>> newdf
contract month year buys qty avgpx comdty
0 C Z 5 Sell -19 424.828947 Comdty
1 CC U 5 Buy 5 3328.000000 Comdty
2 SB V 5 Buy 12 11.637500 Comdty
3 W Z 5 Sell -5 554.850000 Comdty
[Groups: ['contract', 'month', 'year'] (n=4)]
I am the author of the package. Feel free to submit issues if you have any questions.
ErnestScribbler’s answer is much faster than the accepted solution. Here a multivariate analogue:
def weighted_average(df,data_col,weight_col,by_col):
''' Now data_col can be a list of variables '''
df_data = df[data_col].multiply(df[weight_col], axis='index')
df_weight = pd.notnull(df[data_col]).multiply(df[weight_col], axis='index')
df_data[by_col] = df[by_col]
df_weight[by_col] = df[by_col]
result = df_data.groupby(by_col).sum() / df_weight.groupby(by_col).sum()
return result
I came across this thread when confronted with a similar problem. In my case, I wanted to generate a weighted metric of a quarterback rating should more than one quarterback have attempted a pass in a given NFL game.
I may change the code if I start running into significant performance issues as I scale. For now, I preferred squeezing my solution into the .agg
function alongside other transforms. Happy to see if someone has a simpler solution to achieve the same end. Ultimately, I employed a closure pattern.
The magic of the closure approach, if this is an unfamiliar pattern to a future reader, is that I can still return a simple function to pandas’ .agg()
method, but I get to do so with some additional information preconfigured from the top-level factory
function.
def weighted_mean_factory(*args, **kwargs):
weights = kwargs.get('w').copy()
def weighted_mean(x):
x_mask = ~np.isnan(x)
w = weights.loc[x.index]
if all(v is False for v in x_mask):
raise ValueError('there are no non-missing x variable values')
return np.average(x[x_mask], weights=w[x_mask])
return weighted_mean
res_df = df.groupby(['game_id', 'team'])
.agg(pass_player_cnt=('attempts', count_is_not_zero),
completions=('completions', 'sum'),
attempts=('attempts', 'sum'),
pass_yds=('pass_yards', 'sum'),
pass_tds=('pass_tds', 'sum'),
pass_int=('pass_int', 'sum'),
sack_taken=('sacks_taken', 'sum'),
sack_yds_loss=('sack_yds_loss', 'sum'),
longest_completion=('longest_completion', 'max'),
qbr_w_avg=('qb_rating', weighted_mean_factory(x='qb_rating', w=df['attempts']))
)
Some basic benchmarking stats on a DataFrame with the shape (5436, 31) are below and are not cause for concern on my end in terms of performance at this stage:
149 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This combines the original approach by jrjc with the closure approach by MB. It has the advantage of being able to reuse the closure function.
import pandas as pd
def group_weighted_mean_factory(df: pd.DataFrame, weight_col_name: str):
# Ref: https://stackoverflow.com/a/69787938/
def group_weighted_mean(x):
try:
return np.average(x, weights=df.loc[x.index, weight_col_name])
except ZeroDivisionError:
return np.average(x)
return group_weighted_mean
df = ... # Define
group_weighted_mean = group_weighted_mean_factory(df, "adjusted_lots")
g = df.groupby(...) # Define
agg_df = g.agg({'price': group_weighted_mean})
I have a dataframe:
Out[78]:
contract month year buys adjusted_lots price
0 W Z 5 Sell -5 554.85
1 C Z 5 Sell -3 424.50
2 C Z 5 Sell -2 424.00
3 C Z 5 Sell -2 423.75
4 C Z 5 Sell -3 423.50
5 C Z 5 Sell -2 425.50
6 C Z 5 Sell -3 425.25
7 C Z 5 Sell -2 426.00
8 C Z 5 Sell -2 426.75
9 CC U 5 Buy 5 3328.00
10 SB V 5 Buy 5 11.65
11 SB V 5 Buy 5 11.64
12 SB V 5 Buy 2 11.60
I need a sum of adjusted_lots , price which is weighted average , of price and adjusted_lots , grouped by all the other columns , ie. grouped by (contract, month , year and buys)
Similar solution on R was achieved by following code, using dplyr, however unable to do the same in pandas.
> newdf = df %>%
select ( contract , month , year , buys , adjusted_lots , price ) %>%
group_by( contract , month , year , buys) %>%
summarise(qty = sum( adjusted_lots) , avgpx = weighted.mean(x = price , w = adjusted_lots) , comdty = "Comdty" )
> newdf
Source: local data frame [4 x 6]
contract month year comdty qty avgpx
1 C Z 5 Comdty -19 424.8289
2 CC U 5 Comdty 5 3328.0000
3 SB V 5 Comdty 12 11.6375
4 W Z 5 Comdty -5 554.8500
is the same possible by groupby or any other solution ?
EDIT: update aggregation so it works with recent version of pandas
To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:
# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])
# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)
# Groupby and aggregate with namedAgg [1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),
price_weighted_mean=("price", wm))
adjusted_lots price_weighted_mean
contract month year buys
C Z 5 Sell -19 424.828947
CC U 5 Buy 5 3328.000000
SB V 5 Buy 12 11.637500
W Z 5 Sell -5 554.850000
You can see more here:
and in a similar question here:
Doing weighted average by groupby(…).apply(…) can be very slow (100x from the following).
See my answer (and others) on this thread.
def weighted_average(df,data_col,weight_col,by_col):
df['_data_times_weight'] = df[data_col]*df[weight_col]
df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
g = df.groupby(by_col)
result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
del df['_data_times_weight'], df['_weight_where_notnull']
return result
The solution that uses a dict of aggregation functions will be deprecated in a future version of pandas (version 0.22):
FutureWarning: using a dict with renaming is deprecated and will be removed in a future
version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Use a groupby apply and return a Series to rename columns as discussed in:
Rename result columns from Pandas aggregation ("FutureWarning: using a dict with renaming is deprecated")
def my_agg(x):
names = {'weighted_ave_price': (x['adjusted_lots'] * x['price']).sum()/x['adjusted_lots'].sum()}
return pd.Series(names, index=['weighted_ave_price'])
produces the same result:
>df.groupby(["contract", "month", "year", "buys"]).apply(my_agg)
weighted_ave_price
contract month year buys
C Z 5 Sell 424.828947
CC U 5 Buy 3328.000000
SB V 5 Buy 11.637500
W Z 5 Sell 554.850000
Wouldn’t it be a lot more simpler to do this.
- Multiply (adjusted_lots * price_weighted_mean) into a new column "X"
- Use groupby().sum() for columns "X" and "adjusted_lots" to get
grouped df df_grouped - Compute weighted average on the df_grouped as
df_grouped[‘X’]/df_grouped[‘adjusted_lots’]
With datar
, you don’t have to learn pandas APIs to transition your R code:
>>> from datar.all import f, tibble, c, rep, select, summarise, sum, weighted_mean, group_by
>>> df = tibble(
... contract=c('W', rep('C', 8), 'CC', rep('SB', 3)),
... month=c(rep('Z', 9), 'U', rep('V', 3)),
... year=5,
... buys=c(rep('Sell', 9), rep('Buy', 4)),
... adjusted_lots=[-5, -3, -2, -2, -3, -2, -3, -2, -2, 5, 5, 5, 2],
... price=[554.85, 424.50, 424.00, 423.75, 423.50, 425.50, 425.25, 426.00, 426.75,3328.00, 11.65, 11.64, 1
1.60]
... )
>>> df
contract month year buys adjusted_lots price
0 W Z 5 Sell -5 554.85
1 C Z 5 Sell -3 424.50
2 C Z 5 Sell -2 424.00
3 C Z 5 Sell -2 423.75
4 C Z 5 Sell -3 423.50
5 C Z 5 Sell -2 425.50
6 C Z 5 Sell -3 425.25
7 C Z 5 Sell -2 426.00
8 C Z 5 Sell -2 426.75
9 CC U 5 Buy 5 3328.00
10 SB V 5 Buy 5 11.65
11 SB V 5 Buy 5 11.64
12 SB V 5 Buy 2 11.60
>>> newdf = df >>
... select(f.contract, f.month, f.year, f.buys, f.adjusted_lots, f.price) >>
... group_by(f.contract, f.month, f.year, f.buys) >>
... summarise(
... qty = sum(f.adjusted_lots),
... avgpx = weighted_mean(x = f.price , w = f.adjusted_lots),
... comdty = "Comdty"
... )
[2021-05-24 13:11:03][datar][ INFO] `summarise()` has grouped output by ['contract', 'month', 'year'] (overr
ide with `_groups` argument)
>>>
>>> newdf
contract month year buys qty avgpx comdty
0 C Z 5 Sell -19 424.828947 Comdty
1 CC U 5 Buy 5 3328.000000 Comdty
2 SB V 5 Buy 12 11.637500 Comdty
3 W Z 5 Sell -5 554.850000 Comdty
[Groups: ['contract', 'month', 'year'] (n=4)]
I am the author of the package. Feel free to submit issues if you have any questions.
ErnestScribbler’s answer is much faster than the accepted solution. Here a multivariate analogue:
def weighted_average(df,data_col,weight_col,by_col):
''' Now data_col can be a list of variables '''
df_data = df[data_col].multiply(df[weight_col], axis='index')
df_weight = pd.notnull(df[data_col]).multiply(df[weight_col], axis='index')
df_data[by_col] = df[by_col]
df_weight[by_col] = df[by_col]
result = df_data.groupby(by_col).sum() / df_weight.groupby(by_col).sum()
return result
I came across this thread when confronted with a similar problem. In my case, I wanted to generate a weighted metric of a quarterback rating should more than one quarterback have attempted a pass in a given NFL game.
I may change the code if I start running into significant performance issues as I scale. For now, I preferred squeezing my solution into the .agg
function alongside other transforms. Happy to see if someone has a simpler solution to achieve the same end. Ultimately, I employed a closure pattern.
The magic of the closure approach, if this is an unfamiliar pattern to a future reader, is that I can still return a simple function to pandas’ .agg()
method, but I get to do so with some additional information preconfigured from the top-level factory
function.
def weighted_mean_factory(*args, **kwargs):
weights = kwargs.get('w').copy()
def weighted_mean(x):
x_mask = ~np.isnan(x)
w = weights.loc[x.index]
if all(v is False for v in x_mask):
raise ValueError('there are no non-missing x variable values')
return np.average(x[x_mask], weights=w[x_mask])
return weighted_mean
res_df = df.groupby(['game_id', 'team'])
.agg(pass_player_cnt=('attempts', count_is_not_zero),
completions=('completions', 'sum'),
attempts=('attempts', 'sum'),
pass_yds=('pass_yards', 'sum'),
pass_tds=('pass_tds', 'sum'),
pass_int=('pass_int', 'sum'),
sack_taken=('sacks_taken', 'sum'),
sack_yds_loss=('sack_yds_loss', 'sum'),
longest_completion=('longest_completion', 'max'),
qbr_w_avg=('qb_rating', weighted_mean_factory(x='qb_rating', w=df['attempts']))
)
Some basic benchmarking stats on a DataFrame with the shape (5436, 31) are below and are not cause for concern on my end in terms of performance at this stage:
149 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This combines the original approach by jrjc with the closure approach by MB. It has the advantage of being able to reuse the closure function.
import pandas as pd
def group_weighted_mean_factory(df: pd.DataFrame, weight_col_name: str):
# Ref: https://stackoverflow.com/a/69787938/
def group_weighted_mean(x):
try:
return np.average(x, weights=df.loc[x.index, weight_col_name])
except ZeroDivisionError:
return np.average(x)
return group_weighted_mean
df = ... # Define
group_weighted_mean = group_weighted_mean_factory(df, "adjusted_lots")
g = df.groupby(...) # Define
agg_df = g.agg({'price': group_weighted_mean})