Cumulative sum with missing categories in pandas

Question:

Suppose I have the following dataset

df_dict = ({'unit' : [1, 1, 1, 2, 2, 2], 'cat' : [1, 2, 3, 1, 2, 4], 
           'count' : [8, 3, 2, 2, 8, 7] })
df = pd.DataFrame(df_dict)

df.set_index('unit', inplace = True)

Which looks like this:

    cat count
unit        
1    1   8
1    2   3
1    3   2
2    1   2
2    2   8
2    4   7

The count gives the frequency in which different categories where observed in a unit.
What I’d like to get is the cumulative frequency of the four categories for each unit. Note that category 4 is missing from unit 1 and category 3 is missing from unit 2.

Thus, the end result would be

for unit 1:

[8/13, 11/13, 13/13, 13/13]

and for unit 2:

[2/17, 10/17, 10/17, 17/17]

I know how to get the cumulative sum with groupby and cumsum, but then unit 1, for example, doesn’t have a value for the missing category 4.

Thanks for your time!

Asked By: cd98

||

Answers:

import pandas as pd


df_dict = ({'unit' : [1, 1, 1, 2, 2, 2], 'cat' : [1, 2, 3, 1, 2, 4], 
           'count' : [8, 3, 2, 2, 8, 7] })
df = pd.DataFrame(df_dict)

df.set_index('unit', inplace = True)    

cumsum_count = df.groupby(level=0).apply(lambda x: pd.Series(x['count'].cumsum().values, index=x['cat']))
# unit  cat
# 1     1       8
#       2      11
#       3      13
# 2     1       2
#       2      10
#       4      17
# dtype: int64

cumsum_count = cumsum_count.unstack(level=1).fillna(method='ffill', axis=1)
# cat   1   2   3   4
# unit               
# 1     8  11  13  13
# 2     2  10  10  17

totals = df.groupby(level=0)['count'].sum()
# unit
# 1       13
# 2       17
# Name: count, dtype: int64

cumsum_dist = cumsum_count.div(totals, axis=0)
print(cumsum_dist)

yields

cat          1         2         3  4
unit                                 
1     0.615385  0.846154  1.000000  1
2     0.117647  0.588235  0.588235  1

I really don’t know how to explain this solution — probably because I arrived at it somewhat serendipidously. Inspiration
came from Jeff’s solution, which used

s.apply(lambda x: pd.Series(1, index=x))

to associate values with an index. Once you’ve associated the cumulative counts (values), e.g. [8,11,13], with the cat numbers (index), e.g. [1,2,3], you are basically home free. The rest is just standard applications of unstack, fillna, div and groupby.

Answered By: unutbu
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.