pandas group by one column, aggregate another column, filter on a different column

Question:

says this is my data.

pd.DataFrame({'num_legs': [4,4,5,6,7,4,2,3,4, 2,4,4,5,6,7,4,2,3,3,5,5,6], 'num_wings': [2,7,21,0,21,13,23,43, 2,7,21,13,23,43,23,23,23,11,26,32,75,13], 'new_col':np.arange(22)})

I would like to do the following.

  1. Group by ‘num_legs’, and compute rolling(3, min_periods =1) (rolling mean of past 3, with minimum one value in rolling 3) for the ‘new_col’
  2. However when computing rolling(3), I don’t want to take all values of new_col in the group, I wanted to take values in new_col, which have num_wings >10.
  3. Need a transform for #2 value, i.e, I would like to populate the result in #2 above, for all rows in the df.
  4. EDIT – Also, rows that have num_wings > 10, should get a rolling mean ( = previous or next row value, by the group)

How can I do this? Something like below is what I am thinking, but it’s incorrect.

df.groupby('num_legs')['new_col'].transform(lambda x: df.loc[df['num_wings']>10, 'new_col'].rolling(3))
Asked By: tjt

||

Answers:

df_filtered = df[df['num_wings'] > 10]
df['rolling_mean'] = df_filtered.groupby('num_legs')['new_col'].rolling(
    3).mean().reset_index(level=0, drop=True)

# Forward fill the missing values in rolling_mean for the first two rows of each group
df['rolling_mean'] = df.groupby(
    'num_legs')['rolling_mean'].apply(lambda x: x.ffill())
df['rolling_mean'].fillna(0, inplace=True)

# Print the resulting DataFrame
print(df)

df_filtered is first created by filtering df for rows where the value in the num_wings column is greater than 10. Then, df_filtered is grouped by the num_legs column and a rolling mean is calculated over the new_col column with a window size of 3.

Since df_filtered only contains the rows where num_wings is greater than 10, the resulting rolling mean series will only contain values for the groups where at least one row meets this criterion. Therefore, when the rolling mean series is added to the original DataFrame as a new column (rolling_mean), it will only contain values for the groups where num_wings is greater than 10.

In order to forward fill missing values in the rolling_mean column for the first two rows of each group, the groupby method is used again, this time on the num_legs column only, and the ffill method is applied to each group separately. This ensures that missing values are only filled with the previous value within each group, and not across groups with different values in the num_legs column.

Output:

    num_legs  num_wings  new_col  rolling_mean
0          4          2        0      0.000000
1          4          7        1      0.000000
2          5         21        2      0.000000
3          6          0        3      0.000000
4          7         21        4      0.000000
5          4         13        5      0.000000
6          2         23        6      0.000000
7          3         43        7      0.000000
8          4          2        8      0.000000
9          2          7        9      0.000000
10         4         21       10      0.000000
11         4         13       11      8.666667
12         5         23       12      0.000000
13         6         43       13      0.000000
14         7         23       14      0.000000
15         4         23       15     12.000000
16         2         23       16      0.000000
17         3         11       17      0.000000
18         3         26       18     14.000000
19         5         32       19     11.000000
20         5         75       20     17.000000
21         6         13       21      0.000000
Answered By: C_Turbo
import pandas as pd
import numpy as np

df = pd.DataFrame({'num_legs': [4,4,5,6,7,4,2,3,4, 2,4,4,5,6,7,4,2,3,3,5,5,6], 
              'num_wings': [2,7,21,0,21,13,23,43, 2,7,21,13,23,43,23,23,23,11,26,32,75,13], 
              'new_col':np.arange(22)})

m = df['num_wings'].gt(10)

df['mean'] = (df.rolling(3)
                .apply(lambda w: df.iloc[w.index][m]['new_col'].reset_index(drop=True)
                .mean())
                .fillna(0) )['new_col']

print(df)
    num_legs  num_wings  new_col  mean
0          4          2        0   0.0
1          4          7        1   0.0
2          5         21        2   2.0
3          6          0        3   2.0
4          7         21        4   3.0
5          4         13        5   4.5
6          2         23        6   5.0
7          3         43        7   6.0
8          4          2        8   6.5
9          2          7        9   7.0
10         4         21       10  10.0
11         4         13       11  10.5
12         5         23       12  11.0
13         6         43       13  12.0
14         7         23       14  13.0
15         4         23       15  14.0
16         2         23       16  15.0
17         3         11       17  16.0
18         3         26       18  17.0
19         5         32       19  18.0
20         5         75       20  19.0
21         6         13       21  20.0
Answered By: Laurent B.
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.