How can I use cumsum within a group in Pandas?

Question:

I have

df = pd.DataFrame.from_dict({'id': ['A', 'B', 'A', 'C', 'D', 'B', 'C'], 'val': [1,2,-3,1,5,6,-2], 'stuff':['12','23232','13','1234','3235','3236','732323']})

  id   stuff  val
0  A      12    1
1  B   23232    2
2  A      13   -3
3  C    1234    1
4  D    3235    5
5  B    3236    6
6  C  732323   -2

I’d like to get a running sum of val for each id, so the desired output looks like this:

  id   stuff  val  cumsum
0  A      12    1   1
1  B   23232    2   2
2  A      13   -3   -2
3  C    1234    1   1
4  D    3235    5   5
5  B    3236    6   8
6  C  732323   -2  -1

This is what I tried:

df['cumsum'] = df.groupby('id').cumsum(['val'])

This is the error I get:

ValueError: Wrong number of items passed 0, placement implies 1
Asked By: Baron Yugovich

||

Answers:

You can call transform and pass the cumsum function to add that column to your df:

In [156]:
df['cumsum'] = df.groupby('id')['val'].transform(pd.Series.cumsum)
df

Out[156]:
  id   stuff  val  cumsum
0  A      12    1       1
1  B   23232    2       2
2  A      13   -3      -2
3  C    1234    1       1
4  D    3235    5       5
5  B    3236    6       8
6  C  732323   -2      -1

With respect to your error, you can’t call cumsum on a Series groupby object, secondly you’re passing the name of the column as a list which is meaningless.

So this works:

In [159]:
df.groupby('id')['val'].cumsum()

Out[159]:
0    1
1    2
2   -2
3    1
4    5
5    8
6   -1
dtype: int64
Answered By: EdChum

cumsum is one of those functions (e.g. cumprod, rank etc.) that return a Series / dataframe that is indexed the same as the original dataframe, so all methods to supply a function to groupby work (and produce the same output).

All of the following are equivalent.

x = df.groupby('id')['val'].agg('cumsum')
y = df.groupby('id')['val'].apply('cumsum')
z = df.groupby('id')['val'].cumsum()
w = df.groupby('id')['val'].transform('cumsum')

all(x.equals(d) for d in [y, z, w]) # True

Also, df.groupby('id').cumsum() computes the cumulative sum for all columns in df grouped by 'id'.

Answered By: cottontail