Take the sum of every N rows per group in a pandas DataFrame

Question:

Just want to mention that this question is not a duplicate of this: Take the sum of every N rows in pandas series

My problem is a bit different as I want to calculate N rows per group. My current code looks like this:

import pandas as pd

df = pd.DataFrame({'ID':['AA','AA','AA','BB','BB','BB'], 
                   'DATE':['2021-01-01','2021-01-03','2021-01-08','2021-03-04','2021-03-06','2021-03-08'],
                   'VALUE':[10,15,25,40,60,90]})

df['DATE'] = pd.to_datetime(df['DATE'])
df = df.sort_values(by=['ID','DATE'])
df.head(10)

Sample DataFrame:

+----+------------+-------+
| ID | DATE       | VALUE |
+----+------------+-------+
| AA | 2021-01-01 |   10  |
+----+------------+-------+
| AA | 2021-01-03 |   15  |
+----+------------+-------+
| AA | 2021-01-08 |   25  |
+----+------------+-------+
| BB | 2021-03-04 |   40  |
+----+------------+-------+
| BB | 2021-03-06 |   60  |
+----+------------+-------+
| BB | 2021-03-08 |   90  |
+----+------------+-------+

I apply this preprocessing based on the post:

#Calculate result
df.groupby(['ID', df.index//2]).agg({'VALUE':'mean', 'DATE':'median'}).reset_index()

I get this:

+----+-------+------------+
| ID | VALUE | DATE       |
+----+-------+------------+
| AA | 12.5  | 2021-01-02 |
+----+-------+------------+
| AA | 25    | 2021-01-08 |
+----+-------+------------+
| BB | 40    | 2021-03-04 |
+----+-------+------------+
| BB | 75    | 2021-03-07 |
+----+-------+------------+

But I want this:

+----+-------+------------+
| ID | VALUE | DATE       |
+----+-------+------------+
| AA | 12.5  | 2021-01-02 |
+----+-------+------------+
| AA | 25    | 2021-01-08 |
+----+-------+------------+
| BB | 50    | 2021-03-05 |
+----+-------+------------+
| BB | 90    | 2021-03-08 |
+----+-------+------------+

It seems like pandas index does not work well when my groups are not perfectly aligned and it messes up the beginning of the next series and how aggregations happen. Any suggestions? My dates can be completely irregular by the way.

Asked By: Nick

||

Answers:

You can use groupby.cumcount to form a subgroup:

N = 2
group = df.groupby('ID').cumcount()//N

out = (df.groupby(['ID', group])
         .agg({'VALUE':'mean', 'DATE':'median'})
         .droplevel(1).reset_index()
      )

output:

   ID  VALUE       DATE
0  AA   12.5 2021-01-02
1  AA   25.0 2021-01-08
2  BB   50.0 2021-03-05
3  BB   90.0 2021-03-08
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.