How to create a cumulative list of values, by group, in a Pandas dataframe?
Question:
I’m trying to add a new column to the DataFrame, that consists of a cumulative list (by group) of another column.
For example:
df = pd.DataFrame(data={'group1': [1, 1, 2, 2, 2], 'value': [1, 2, 3, 4, 5]})
Expected output:
group1 value cumsum_column
0 1 1 [1]
1 1 2 [1, 2]
2 2 3 [3]
3 2 4 [3, 4]
4 2 5 [3, 4, 5]
What is the best way to accomplish this?
One way I’ve tried that doesn’t work:
df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].cumsum()
This throws the error:
TypeError: cumsum is not supported for object dtype
EDIT:
To be clearer, I’m looking to find out why this is not working + looking for the fastest way for this to happen — as I’m looking to use it on big dataframes.
Answers:
Use GroupBy.transform
with lambda function:
f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)
print (df)
group1 value cumsum_column
0 1 1 [1]
1 1 2 [1, 2]
2 2 3 [3]
3 2 4 [3, 4]
4 2 5 [3, 4, 5]
Thank you @mozway for improved solution:
g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, 1)]
I’m looking to find out why this is not working
In my opinion pandas devs want performant solution for GroupBy.cumsum
working only with numeric data.
Your solution working with Series.cumsum
:
df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
print (df)
group1 value value_list cumsum_column
0 1 1 [1] [1]
1 1 2 [2] [1, 2]
2 2 3 [3] [3]
3 2 4 [4] [3, 4]
4 2 5 [5] [3, 4, 5]
You can use a custom function in groupby.transform
:
def accumulate(s):
out = [[]]
for x in s:
out.append(out[-1]+[x])
return out[1:]
df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)
Output:
group1 value cumsum_column
0 1 1 [1]
1 1 2 [1, 2]
2 2 3 [3]
3 2 4 [3, 4]
4 2 5 [3, 4, 5]
Why your solution failed?
because groupby.cumsum
is restricted to numeric data (Thus the "TypeError: cumsum is not supported for object dtype" error).
You would have needed to use a lambda in transform
(df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
.
timings:
Tested on 100k rows with 100 groups.
%%timeit
df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)
# 199 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
# 207 ms ± 7.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)
# 6.65 s ± 483 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
### fix of the logic of the other solution to run faster
%%timeit
g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, start=1)]
# 207 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
i like
def fun(ser):
return [ser.head(i+1).to_list() for i in range(len(ser))]
df['cumsum_column'] = df.groupby('group1').value.transform(fun)
or even (fully in pandas)
def fun2(ser):
return (
pd
.Series(range(len(ser)))
.add(1)
.map(lambda i : ser.head(i).to_list())
.set_axis(ser.index)
)
df['cumsum_column'] = df.groupby('group1').value.transform(fun2)
I’m trying to add a new column to the DataFrame, that consists of a cumulative list (by group) of another column.
For example:
df = pd.DataFrame(data={'group1': [1, 1, 2, 2, 2], 'value': [1, 2, 3, 4, 5]})
Expected output:
group1 value cumsum_column
0 1 1 [1]
1 1 2 [1, 2]
2 2 3 [3]
3 2 4 [3, 4]
4 2 5 [3, 4, 5]
What is the best way to accomplish this?
One way I’ve tried that doesn’t work:
df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].cumsum()
This throws the error:
TypeError: cumsum is not supported for object dtype
EDIT:
To be clearer, I’m looking to find out why this is not working + looking for the fastest way for this to happen — as I’m looking to use it on big dataframes.
Use GroupBy.transform
with lambda function:
f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)
print (df)
group1 value cumsum_column
0 1 1 [1]
1 1 2 [1, 2]
2 2 3 [3]
3 2 4 [3, 4]
4 2 5 [3, 4, 5]
Thank you @mozway for improved solution:
g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, 1)]
I’m looking to find out why this is not working
In my opinion pandas devs want performant solution for GroupBy.cumsum
working only with numeric data.
Your solution working with Series.cumsum
:
df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
print (df)
group1 value value_list cumsum_column
0 1 1 [1] [1]
1 1 2 [2] [1, 2]
2 2 3 [3] [3]
3 2 4 [4] [3, 4]
4 2 5 [5] [3, 4, 5]
You can use a custom function in groupby.transform
:
def accumulate(s):
out = [[]]
for x in s:
out.append(out[-1]+[x])
return out[1:]
df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)
Output:
group1 value cumsum_column
0 1 1 [1]
1 1 2 [1, 2]
2 2 3 [3]
3 2 4 [3, 4]
4 2 5 [3, 4, 5]
Why your solution failed?
because groupby.cumsum
is restricted to numeric data (Thus the "TypeError: cumsum is not supported for object dtype" error).
You would have needed to use a lambda in transform
(df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
.
timings:
Tested on 100k rows with 100 groups.
%%timeit
df['cumsum_column'] = df.groupby('group1')['value'].transform(accumulate)
# 199 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['value_list'] = [[i] for i in df['value']]
df['cumsum_column'] = df.groupby('group1')['value_list'].transform(lambda x: x.cumsum())
# 207 ms ± 7.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
f = lambda x: [list(x)[:i] for i, y in enumerate(x, 1)]
df['cumsum_column'] = df.groupby('group1')['value'].transform(f)
# 6.65 s ± 483 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
### fix of the logic of the other solution to run faster
%%timeit
g = df.groupby('group1')['value']
d = g.agg(list)
df['cumsum_column'] = [d[k][:i] for k, grp in g for i, x in enumerate(grp, start=1)]
# 207 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
i like
def fun(ser):
return [ser.head(i+1).to_list() for i in range(len(ser))]
df['cumsum_column'] = df.groupby('group1').value.transform(fun)
or even (fully in pandas)
def fun2(ser):
return (
pd
.Series(range(len(ser)))
.add(1)
.map(lambda i : ser.head(i).to_list())
.set_axis(ser.index)
)
df['cumsum_column'] = df.groupby('group1').value.transform(fun2)