How to create a cumulative list of values, by group, in a Pandas dataframe?

Question:

I’m trying to add a new column to the DataFrame, that consists of a cumulative list (by group) of another column.

For example:

df = pd.DataFrame(data={'group1': [1, 1, 2, 2, 2], 'value': [1, 2, 3, 4, 5]})

Expected output:

   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

What is the best way to accomplish this?

One way I’ve tried that doesn’t work:

df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].cumsum()

This throws the error:

TypeError: cumsum is not supported for object dtype

EDIT:
To be clearer, I’m looking to find out why this is not working + looking for the fastest way for this to happen — as I’m looking to use it on big dataframes.

Asked By: Nils Mackay

||

Answers:

Use GroupBy.transform with lambda function:

f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)  
print (df)
   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

Thank you @mozway for improved solution:

g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, 1)]

I’m looking to find out why this is not working

In my opinion pandas devs want performant solution for GroupBy.cumsum working only with numeric data.


Your solution working with Series.cumsum:

df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
print (df)
   group1  value value_list cumsum_column
0       1      1        [1]           [1]
1       1      2        [2]        [1, 2]
2       2      3        [3]           [3]
3       2      4        [4]        [3, 4]
4       2      5        [5]     [3, 4, 5]
Answered By: jezrael

You can use a custom function in groupby.transform:

def accumulate(s):
    out = [[]]
    for x in s:
        out.append(out[-1]+[x])
    return out[1:]

df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)

Output:

   group1  value cumsum_column
0       1      1           [1]
1       1      2        [1, 2]
2       2      3           [3]
3       2      4        [3, 4]
4       2      5     [3, 4, 5]

Why your solution failed?

because groupby.cumsum is restricted to numeric data (Thus the "TypeError: cumsum is not supported for object dtype" error).

You would have needed to use a lambda in transform (df.groupby('group1')['value_list'].transform(lambda x: x.cumsum()).

timings:

Tested on 100k rows with 100 groups.

%%timeit
df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)
# 199 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
# 207 ms ± 7.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)
# 6.65 s ± 483 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

### fix of the logic of the other solution to run faster
%%timeit
g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, start=1)]
# 207 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

enter image description here

Answered By: mozway

i like

def fun(ser):
    return [ser.head(i+1).to_list() for i in range(len(ser))]

df['cumsum_column'] = df.groupby('group1').value.transform(fun)

or even (fully in pandas)

def fun2(ser):
    return ( 
        pd
        .Series(range(len(ser)))
        .add(1)
        .map(lambda i : ser.head(i).to_list())
        .set_axis(ser.index)
    )

df['cumsum_column'] = df.groupby('group1').value.transform(fun2)
Answered By: Albert
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.