Pandas Groupby only same ID and when column value is false

Question:

I have the following problem I am trying to solve currently. A dataframe with a lot of numerical columns consisting out of a "serial_number" column that act as ID. A column "update" that is either True or False. And also some numerical columns that I need to sum/divide up with the following: Sum/divide all numerical columns where the rows are "update"=False to the next rows where "update" is True (including the row where "update" = True.

To give you an additional context, the entries are used for training a machine learning model, but for the rows where "update"=false I don’t have the target variable. Therefore, I need to sum or mean the values to the next "update"=true row.

Thanks in advance!

As an example, this would be the input table:

serial_number model numerical_mean numerical_1_sum numerical_2_sum update
a 2023-01-01 5 10 20 False
a 2023-01-02 10 15 10 False
a 2023-01-03 15 15 10 True
b 2023-01-01 10 15 10 False
b 2023-01-02 15 15 10 True
b 2023-01-03 15 15 10 False
b 2023-01-04 15 15 10 True
b 2023-01-05 15 15 10 False
c 2023-01-04 15 15 10 True

The resulting output should look like this:

serial_number date numerical_mean numerical_1_sum numerical_2_sum update
a 2023-01-03 10 40 40 True
b 2023-01-02 12.5 30 20 True
b 2023-01-04 15 30 20 True
c 2023-01-04 15 15 10 True

The output table has the same amount of rows as the input table has rows where "update"=True. So basically, I try to take all rows within the same serial_number between the first "update"=false and first "update"=true rows and either sum them up or take the average.

Asked By: bktllr

||

Answers:

Code

# filter the columns that you would like to aggregate
c1 = df.filter(like='_sum')
c2 = df.filter(like='_mean')

# create a agg dictionary which maps column names
# to the corresponding aggregation functions
agg_dict = {
    'model': 'last', 
    'update': 'any',
    **dict.fromkeys(c1, 'sum'), 
    **dict.fromkeys(c2, 'mean'),
}

# grouper to identify different blocks of rows followed by True
b = df[::-1]['update'].cumsum()

# group the dataframe by serial_number and blocks and aggregate
result = df.groupby(['serial_number', b]).agg(agg_dict)

# Query the results to remove the rows that do 
# not have any subsequent rows with 'update=true',
# for example, (b, 2023-01-05).
result = result.droplevel(1).query('update').reset_index()

Result

  serial_number       model  update  numerical_1_sum  numerical_2_sum  numerical_mean
0             a  2023-01-03    True               40               40            10.0
1             b  2023-01-04    True               30               20            15.0
2             b  2023-01-02    True               30               20            12.5
3             c  2023-01-04    True               15               10            15.0
Answered By: Shubham Sharma