Pandas Groupby only same ID and when column value is false

Question

I have the following problem I am trying to solve currently. A dataframe with a lot of numerical columns consisting out of a "serial_number" column that act as ID. A column "update" that is either True or False. And also some numerical columns that I need to sum/divide up with the following: Sum/divide all numerical columns where the rows are "update"=False to the next rows where "update" is True (including the row where "update" = True.

To give you an additional context, the entries are used for training a machine learning model, but for the rows where "update"=false I don’t have the target variable. Therefore, I need to sum or mean the values to the next "update"=true row.

Thanks in advance!

As an example, this would be the input table:

serial_number	model	numerical_mean	numerical_1_sum	numerical_2_sum	update
a	2023-01-01	5	10	20	False
a	2023-01-02	10	15	10	False
a	2023-01-03	15	15	10	True
b	2023-01-01	10	15	10	False
b	2023-01-02	15	15	10	True
b	2023-01-03	15	15	10	False
b	2023-01-04	15	15	10	True
b	2023-01-05	15	15	10	False
c	2023-01-04	15	15	10	True

The resulting output should look like this:

serial_number	date	numerical_mean	numerical_1_sum	numerical_2_sum	update
a	2023-01-03	10	40	40	True
b	2023-01-02	12.5	30	20	True
b	2023-01-04	15	30	20	True
c	2023-01-04	15	15	10	True

The output table has the same amount of rows as the input table has rows where "update"=True. So basically, I try to take all rows within the same serial_number between the first "update"=false and first "update"=true rows and either sum them up or take the average.

Asked By: bktllr

||

Source

Answer 1

Code

# filter the columns that you would like to aggregate
c1 = df.filter(like='_sum')
c2 = df.filter(like='_mean')

# create a agg dictionary which maps column names
# to the corresponding aggregation functions
agg_dict = {
    'model': 'last', 
    'update': 'any',
    **dict.fromkeys(c1, 'sum'), 
    **dict.fromkeys(c2, 'mean'),
}

# grouper to identify different blocks of rows followed by True
b = df[::-1]['update'].cumsum()

# group the dataframe by serial_number and blocks and aggregate
result = df.groupby(['serial_number', b]).agg(agg_dict)

# Query the results to remove the rows that do 
# not have any subsequent rows with 'update=true',
# for example, (b, 2023-01-05).
result = result.droplevel(1).query('update').reset_index()

Result

  serial_number       model  update  numerical_1_sum  numerical_2_sum  numerical_mean
0             a  2023-01-03    True               40               40            10.0
1             b  2023-01-04    True               30               20            15.0
2             b  2023-01-02    True               30               20            12.5
3             c  2023-01-04    True               15               10            15.0

Answered By: Shubham Sharma

Pandas Groupby only same ID and when column value is false

Question:

Answers:

Code

Result