Row-wise Cumulative Mean Across Grouped Columns using Pandas
Question:
I would like to create multiple columns which show the row-wise cumulative mean for grouped columns. Here is some sample data:
import pandas as pd
data = [[1, 4, 6, 10, 15, 40, 90, 100], [2, 5, 3, 11, 25, 50, 90, 120], [3, 7, 9, 14, 35, 55, 100, 120]]
df = pd.DataFrame(data, columns=['a1', 'a2', 'a3', 'a4', 'b1', 'b2', 'b3', 'b4'])
a1 a2 a3 a4 b1 b2 b3 b4
0 1 4 6 10 15 40 90 100
1 2 5 3 11 25 50 90 120
2 3 7 9 14 35 55 100 120
What I want is to generate new columns like this:
- New column
a1_2
is calculated by the mean of columns a1
and a2
row-wise.
- New column
a1_3
is calculated by the mean of columns a1
, a2
and a3
row-wise.
- New column
a1_4
is calculated by the mean of columns a1
, a2
, a3
and a4
row-wise.
The same should happen for the grouped columns with b
. Of course you can do this manually, but this is not ideal when you have too many variables. Here is the expected output:
df['a1_2'] = df[['a1', 'a2']].mean(axis=1)
df['a1_3'] = df[['a1', 'a2', 'a3']].mean(axis=1)
df['a1_4'] = df[['a1', 'a2', 'a3', 'a4']].mean(axis=1)
df['b1_2'] = df[['b1', 'b2']].mean(axis=1)
df['b1_3'] = df[['b1', 'b2', 'b3']].mean(axis=1)
df['b1_4'] = df[['b1', 'b2', 'b3', 'b4']].mean(axis=1)
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
So I was wondering if there is some automatic way of doing this?
Answers:
IIUC, you can reshape with wide_to_long
, perform a groupby.rolling
, then unstack
and flatten the multiindex:
df.join(pd
.wide_to_long(df.reset_index(), stubnames=['a', 'b'], i='index', j='n')
.groupby(level='index').expanding().mean()
.loc[lambda d: d.index.get_level_values('n')>1]
.droplevel(0).unstack('n')
.pipe(lambda d: d.set_axis(d.columns.map(lambda x: f'{x[0]}1_{x[1]}'), axis=1))
)
output:
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
expanding.mean
for c in ('a', 'b'):
m = df.filter(like=c).expanding(axis=1).mean().iloc[:, 1:]
df[m.columns.str.replace(r'(d+)$', r'1_1', regex=True)] = m
Result
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
Another option:
out = [value.expanding(axis=1).mean()
.rename(columns = lambda col: f"{col[0]}1_{col[1]}")
for _, value in df.groupby(df.columns.str[0], axis = 1)]
pd.concat([df]+out, axis = 1)
a1 a2 a3 a4 b1 b2 b3 b4 a1_1 a1_2 a1_3 a1_4 b1_1 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 1.0 2.5 3.666667 5.25 15.0 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 2.0 3.5 3.333333 5.25 25.0 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 3.0 5.0 6.333333 8.25 35.0 45.0 63.333333 77.50
groups = df.groupby(lambda col: col[0], axis=1)
cummeans = (groups.cumsum(axis=1).div(groups.cumcount().add(1))
.filter(regex="[^1]$")
.rename(lambda col: re.sub(r"(d+)$", r"1_1", col), axis=1))
result = df.join(cummeans)
-
get the groups determined by 1st character of columns (or some other way depending on pattern)
-
get the cumulative means = cumsum / cumcount+1
-
filter out the very first cummeans, e.g., to-be a1_1 etc.
-
insert "1_" into the cummean column names
-
join with the original df
In [19]: groups = df.groupby(lambda col: col[0], axis=1)
In [20]: cummeans = groups.cumsum(axis=1).div(groups.cumcount().add(1))
In [21]: cummeans
Out[21]:
a1 a2 a3 a4 b1 b2 b3 b4
0 1.0 2.5 3.666667 5.25 15.0 27.5 48.333333 61.25
1 2.0 3.5 3.333333 5.25 25.0 37.5 55.000000 71.25
2 3.0 5.0 6.333333 8.25 35.0 45.0 63.333333 77.50
In [22]: _.filter(regex="[^1]$")
Out[22]:
a2 a3 a4 b2 b3 b4
0 2.5 3.666667 5.25 27.5 48.333333 61.25
1 3.5 3.333333 5.25 37.5 55.000000 71.25
2 5.0 6.333333 8.25 45.0 63.333333 77.50
In [23]: _.rename(lambda col: re.sub(r"(d+)$", r"1_1", col), axis=1)
Out[23]:
a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 2.5 3.666667 5.25 27.5 48.333333 61.25
1 3.5 3.333333 5.25 37.5 55.000000 71.25
2 5.0 6.333333 8.25 45.0 63.333333 77.50
In [24]: df.join(_)
Out[24]:
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
as a "one" line:
df.join(df.groupby(lambda col: col[0], axis=1)
.pipe(lambda gr: gr.cumsum(axis=1).div(gr.cumcount().add(1))
.filter(regex="[^1]$")
.rename(lambda col: re.sub(r"(d+)$", r"1_1", col), axis=1)))
def function1(ss:pd.Series):
return ss.to_frame('col1').assign(col2=lambda dd:dd.index.str[1:].astype(int)).assign(col3=lambda dd:dd.index.str[0])
.groupby('col3').apply(lambda dd:dd.col1.cumsum()/dd.col2.iloc[1:]).dropna().droplevel(0)
df1.merge(df1.apply(function1,axis=1),left_index=True,right_index=True,suffixes=['','_1']).pipe(print)
a1 a2 a3 a4 b1 b2 b3 b4 a2_1 a3_1 a4_1 b2_1 b3_1
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333
b4_1
0 61.25
1 71.25
2 77.50
I would like to create multiple columns which show the row-wise cumulative mean for grouped columns. Here is some sample data:
import pandas as pd
data = [[1, 4, 6, 10, 15, 40, 90, 100], [2, 5, 3, 11, 25, 50, 90, 120], [3, 7, 9, 14, 35, 55, 100, 120]]
df = pd.DataFrame(data, columns=['a1', 'a2', 'a3', 'a4', 'b1', 'b2', 'b3', 'b4'])
a1 a2 a3 a4 b1 b2 b3 b4
0 1 4 6 10 15 40 90 100
1 2 5 3 11 25 50 90 120
2 3 7 9 14 35 55 100 120
What I want is to generate new columns like this:
- New column
a1_2
is calculated by the mean of columnsa1
anda2
row-wise. - New column
a1_3
is calculated by the mean of columnsa1
,a2
anda3
row-wise. - New column
a1_4
is calculated by the mean of columnsa1
,a2
,a3
anda4
row-wise.
The same should happen for the grouped columns with b
. Of course you can do this manually, but this is not ideal when you have too many variables. Here is the expected output:
df['a1_2'] = df[['a1', 'a2']].mean(axis=1)
df['a1_3'] = df[['a1', 'a2', 'a3']].mean(axis=1)
df['a1_4'] = df[['a1', 'a2', 'a3', 'a4']].mean(axis=1)
df['b1_2'] = df[['b1', 'b2']].mean(axis=1)
df['b1_3'] = df[['b1', 'b2', 'b3']].mean(axis=1)
df['b1_4'] = df[['b1', 'b2', 'b3', 'b4']].mean(axis=1)
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
So I was wondering if there is some automatic way of doing this?
IIUC, you can reshape with wide_to_long
, perform a groupby.rolling
, then unstack
and flatten the multiindex:
df.join(pd
.wide_to_long(df.reset_index(), stubnames=['a', 'b'], i='index', j='n')
.groupby(level='index').expanding().mean()
.loc[lambda d: d.index.get_level_values('n')>1]
.droplevel(0).unstack('n')
.pipe(lambda d: d.set_axis(d.columns.map(lambda x: f'{x[0]}1_{x[1]}'), axis=1))
)
output:
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
expanding.mean
for c in ('a', 'b'):
m = df.filter(like=c).expanding(axis=1).mean().iloc[:, 1:]
df[m.columns.str.replace(r'(d+)$', r'1_1', regex=True)] = m
Result
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
Another option:
out = [value.expanding(axis=1).mean()
.rename(columns = lambda col: f"{col[0]}1_{col[1]}")
for _, value in df.groupby(df.columns.str[0], axis = 1)]
pd.concat([df]+out, axis = 1)
a1 a2 a3 a4 b1 b2 b3 b4 a1_1 a1_2 a1_3 a1_4 b1_1 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 1.0 2.5 3.666667 5.25 15.0 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 2.0 3.5 3.333333 5.25 25.0 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 3.0 5.0 6.333333 8.25 35.0 45.0 63.333333 77.50
groups = df.groupby(lambda col: col[0], axis=1)
cummeans = (groups.cumsum(axis=1).div(groups.cumcount().add(1))
.filter(regex="[^1]$")
.rename(lambda col: re.sub(r"(d+)$", r"1_1", col), axis=1))
result = df.join(cummeans)
-
get the groups determined by 1st character of columns (or some other way depending on pattern)
-
get the cumulative means = cumsum / cumcount+1
-
filter out the very first cummeans, e.g., to-be a1_1 etc.
-
insert "1_" into the cummean column names
-
join with the original df
In [19]: groups = df.groupby(lambda col: col[0], axis=1)
In [20]: cummeans = groups.cumsum(axis=1).div(groups.cumcount().add(1))
In [21]: cummeans
Out[21]:
a1 a2 a3 a4 b1 b2 b3 b4
0 1.0 2.5 3.666667 5.25 15.0 27.5 48.333333 61.25
1 2.0 3.5 3.333333 5.25 25.0 37.5 55.000000 71.25
2 3.0 5.0 6.333333 8.25 35.0 45.0 63.333333 77.50
In [22]: _.filter(regex="[^1]$")
Out[22]:
a2 a3 a4 b2 b3 b4
0 2.5 3.666667 5.25 27.5 48.333333 61.25
1 3.5 3.333333 5.25 37.5 55.000000 71.25
2 5.0 6.333333 8.25 45.0 63.333333 77.50
In [23]: _.rename(lambda col: re.sub(r"(d+)$", r"1_1", col), axis=1)
Out[23]:
a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 2.5 3.666667 5.25 27.5 48.333333 61.25
1 3.5 3.333333 5.25 37.5 55.000000 71.25
2 5.0 6.333333 8.25 45.0 63.333333 77.50
In [24]: df.join(_)
Out[24]:
a1 a2 a3 a4 b1 b2 b3 b4 a1_2 a1_3 a1_4 b1_2 b1_3 b1_4
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333 61.25
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000 71.25
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333 77.50
as a "one" line:
df.join(df.groupby(lambda col: col[0], axis=1)
.pipe(lambda gr: gr.cumsum(axis=1).div(gr.cumcount().add(1))
.filter(regex="[^1]$")
.rename(lambda col: re.sub(r"(d+)$", r"1_1", col), axis=1)))
def function1(ss:pd.Series):
return ss.to_frame('col1').assign(col2=lambda dd:dd.index.str[1:].astype(int)).assign(col3=lambda dd:dd.index.str[0])
.groupby('col3').apply(lambda dd:dd.col1.cumsum()/dd.col2.iloc[1:]).dropna().droplevel(0)
df1.merge(df1.apply(function1,axis=1),left_index=True,right_index=True,suffixes=['','_1']).pipe(print)
a1 a2 a3 a4 b1 b2 b3 b4 a2_1 a3_1 a4_1 b2_1 b3_1
0 1 4 6 10 15 40 90 100 2.5 3.666667 5.25 27.5 48.333333
1 2 5 3 11 25 50 90 120 3.5 3.333333 5.25 37.5 55.000000
2 3 7 9 14 35 55 100 120 5.0 6.333333 8.25 45.0 63.333333
b4_1
0 61.25
1 71.25
2 77.50