Cumulative sum with missing categories in pandas
Question:
Suppose I have the following dataset
df_dict = ({'unit' : [1, 1, 1, 2, 2, 2], 'cat' : [1, 2, 3, 1, 2, 4],
'count' : [8, 3, 2, 2, 8, 7] })
df = pd.DataFrame(df_dict)
df.set_index('unit', inplace = True)
Which looks like this:
cat count
unit
1 1 8
1 2 3
1 3 2
2 1 2
2 2 8
2 4 7
The count gives the frequency in which different categories where observed in a unit.
What I’d like to get is the cumulative frequency of the four categories for each unit. Note that category 4 is missing from unit 1 and category 3 is missing from unit 2.
Thus, the end result would be
for unit 1:
[8/13, 11/13, 13/13, 13/13]
and for unit 2:
[2/17, 10/17, 10/17, 17/17]
I know how to get the cumulative sum with groupby
and cumsum
, but then unit 1, for example, doesn’t have a value for the missing category 4.
Thanks for your time!
Answers:
import pandas as pd
df_dict = ({'unit' : [1, 1, 1, 2, 2, 2], 'cat' : [1, 2, 3, 1, 2, 4],
'count' : [8, 3, 2, 2, 8, 7] })
df = pd.DataFrame(df_dict)
df.set_index('unit', inplace = True)
cumsum_count = df.groupby(level=0).apply(lambda x: pd.Series(x['count'].cumsum().values, index=x['cat']))
# unit cat
# 1 1 8
# 2 11
# 3 13
# 2 1 2
# 2 10
# 4 17
# dtype: int64
cumsum_count = cumsum_count.unstack(level=1).fillna(method='ffill', axis=1)
# cat 1 2 3 4
# unit
# 1 8 11 13 13
# 2 2 10 10 17
totals = df.groupby(level=0)['count'].sum()
# unit
# 1 13
# 2 17
# Name: count, dtype: int64
cumsum_dist = cumsum_count.div(totals, axis=0)
print(cumsum_dist)
yields
cat 1 2 3 4
unit
1 0.615385 0.846154 1.000000 1
2 0.117647 0.588235 0.588235 1
I really don’t know how to explain this solution — probably because I arrived at it somewhat serendipidously. Inspiration
came from Jeff’s solution, which used
s.apply(lambda x: pd.Series(1, index=x))
to associate values with an index. Once you’ve associated the cumulative counts (values), e.g. [8,11,13], with the cat
numbers (index), e.g. [1,2,3], you are basically home free. The rest is just standard applications of unstack, fillna, div and groupby.
Suppose I have the following dataset
df_dict = ({'unit' : [1, 1, 1, 2, 2, 2], 'cat' : [1, 2, 3, 1, 2, 4],
'count' : [8, 3, 2, 2, 8, 7] })
df = pd.DataFrame(df_dict)
df.set_index('unit', inplace = True)
Which looks like this:
cat count
unit
1 1 8
1 2 3
1 3 2
2 1 2
2 2 8
2 4 7
The count gives the frequency in which different categories where observed in a unit.
What I’d like to get is the cumulative frequency of the four categories for each unit. Note that category 4 is missing from unit 1 and category 3 is missing from unit 2.
Thus, the end result would be
for unit 1:
[8/13, 11/13, 13/13, 13/13]
and for unit 2:
[2/17, 10/17, 10/17, 17/17]
I know how to get the cumulative sum with groupby
and cumsum
, but then unit 1, for example, doesn’t have a value for the missing category 4.
Thanks for your time!
import pandas as pd
df_dict = ({'unit' : [1, 1, 1, 2, 2, 2], 'cat' : [1, 2, 3, 1, 2, 4],
'count' : [8, 3, 2, 2, 8, 7] })
df = pd.DataFrame(df_dict)
df.set_index('unit', inplace = True)
cumsum_count = df.groupby(level=0).apply(lambda x: pd.Series(x['count'].cumsum().values, index=x['cat']))
# unit cat
# 1 1 8
# 2 11
# 3 13
# 2 1 2
# 2 10
# 4 17
# dtype: int64
cumsum_count = cumsum_count.unstack(level=1).fillna(method='ffill', axis=1)
# cat 1 2 3 4
# unit
# 1 8 11 13 13
# 2 2 10 10 17
totals = df.groupby(level=0)['count'].sum()
# unit
# 1 13
# 2 17
# Name: count, dtype: int64
cumsum_dist = cumsum_count.div(totals, axis=0)
print(cumsum_dist)
yields
cat 1 2 3 4
unit
1 0.615385 0.846154 1.000000 1
2 0.117647 0.588235 0.588235 1
I really don’t know how to explain this solution — probably because I arrived at it somewhat serendipidously. Inspiration
came from Jeff’s solution, which used
s.apply(lambda x: pd.Series(1, index=x))
to associate values with an index. Once you’ve associated the cumulative counts (values), e.g. [8,11,13], with the cat
numbers (index), e.g. [1,2,3], you are basically home free. The rest is just standard applications of unstack, fillna, div and groupby.