Pandas groupby and creating a new column based on a calculation of aggregated value

Question:

I’ve got a groupby dataframe that produces the dataframe in the picture below.
I’m trying to figure out how to create an additional column that takes the max value of the product group and then subtracts all the values from it.

For example in the first group for _CustomerID==1, the max of the product group is 52.
So then the column for this product group would be 52 minus each value or [0, 22, 15, 19]. For the second group, the max value is 42 so the new column values for this group would be [0, 12, 9, 12].

I’ve been researching transform() and idmax…but can’t seem to put it all together.

(regions_analysis # <-- DataFrame
 .groupby(['_CustomerID', 'Region'])
 .agg(product=('_ProductID', 'count'),
      )
 )

enter image description here

Asked By: Yogesh Riyat

||

Answers:

This solution uses a helper column that finds the max of each group. Then, it’s a simple subtraction. I’m sure there’s a slick one-liner but the two lines below are very useful patterns in Pandas.

x = [['1', 'Midwest', 52],
    ['1', 'Northeast', 30],
    ['1', 'South', 37],
    ['1', 'West', 33],
    ['2', 'Midwest', 42],
    ['2', 'Northeast', 30],
    ['2', 'South', 33],
    ['2', 'West', 30]]

df = pd.DataFrame(x, columns=['customer_id', 'region', 'product'])

df['max'] = df.groupby('customer_id')['product'].transform('max')
df['max_sub'] = df['max'] - df['product']

If you don’t want the helper/max column visible, you could over-write it like this:

df['max'] = df.groupby('customer_id')['product'].transform('max')
df['max'] = df['max'] - df['product']
Answered By: lummers
x = {('1', 'Midwest'): 52,
    ('1', 'Northeast'): 30,
    ('1', 'South'): 37,
    ('1', 'West'): 33,
    ('2', 'Midwest'): 42,
    ('2', 'Northeast'): 30,
    ('2', 'South'): 33,
    ('2', 'West'): 30}

df = pd.Series(x).to_frame(name='product').rename_axis(index=['customer_id', 'region'])

df['deficit'] = df.groupby('customer_id', group_keys=False)['product'].apply(lambda x: max(x) - x)
df

Note the groupby creates a hierarchical frame with ('_CustomerID', 'Region') as the index and product a column. Functionally, this doesn’t affect @lummers‘s answer. To not rely on a helper function/chaining, you can create your own function to apply to the data, just be sure to turn off key grouping, so the index remains compatible between the frame and the output.

Answered By: Spencer Fretwell
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.