How to sum values across groups without summing duplicates


I have the following df:

     A    B        C       D 
0  foo    a     1200     300  
0  foo    a      700     300  
0  foo    b     1000     300         
1  bar    b      270      70 
1  bar    a      350      70
2  abc    c      270     300 
2  abc    a      350     300

I want to display the sum of values in column D grouped by column B, but I do not want to sum the values in column B for a single value in column A. That is, column D has only one value per value in column A.

foo will only ever have the value 300 and bar will only have the value 70 in column D. The values in this column are just repeated because I have repeated indexes.

I want to print something like (no need to show formatting, I just need to output the correct sums):

a: 300 (from foo) + 300 (from foo) + 70 (from bar) = 670
b: 300 (from foo) + 70 (from bar) = 370
c: 300 (from abc)

That is, values in column D should not be summed together if the value in column A is the same among them.

Asked By: Luiz Scheuer



You could use pd.unique() after the groupby and then sum those values up.

df.groupby('B')['D'].apply(lambda x: sum(pd.unique(x)))
a    370
b    370
Name: D, dtype: int64

For your new example you search for something like this:

df.groupby(['B','A'])['D'].apply(lambda x: sum(pd.unique(x))).groupby('B').sum()


a    670
b    370
c    300
Name: D, dtype: int64
Answered By: Rabinzel
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.