Iterate through chunks of a pandas Dataframe

Question:

I have a pandas.DataFrame that looks like the following:

Week Monday Tuesday Wednesday Thursday Friday
City A 100 300 x z w
City B 200 400 y q p
None None None None None None
Week Monday Tuesday Wednesday Thursday Friday
City A 150 320 a c e
City B 210 470 z t q
City C 260 446 b d f
None None None None None None

This repeats until all weeks in a year are covered (it’s basically a weekly calendar with data in it).

I wish to loop through the DataFrame in chunks, and do some operations with the data within those chunks.

The chunks should be basically "Week-to-Week"-high and "Week-to-Friday"-wide, if that makes sense. However, as you can see, the chunks are not equally large so I can’t hard code the size to be 4×6, for example. They do, although, always go from "Week" to "Week" and go out as far to the right as "Friday".

Is there any intuitive way I can iterate through my DataFrame? Any help is appreciated.

Asked By: Lucas B. Bahadir

||

Answers:

You can try:

df['week_index'] = df.isna().all(axis='columns').astype(int).cumsum()
for _, df_chunk in df.groupby('week_index'):
    # do something

To do it week-to-week:

df['week_index'] = df.isna().all(axis='columns').astype(int).shift(1, fill_value=0).cumsum()
for _, df_chunk in df.groupby('week_index'):
    # process each chunk
Answered By: Learning is a mess

Reproducing your data with the CSV file,

# data.csv
Week,Monday,Tuesday,Wednesday,Thursday,Friday
City 100,300,x,z,w
City B,200,400,y,q,p
None,None,None,None,None,None
Week,Monday,Tuesday,Wednesday,Thursday,Friday
City A,150,320,a,c,e
City B,210,470,z,t,q
City C,260,446,b,d,f
None,None,None,None,None,None

you can do the following in order to clean the original dataset and turn it into something more useful for groups and aggregations:

import pandas as pd                                                                                     
                                                                                                        
# Reproduce your data, then drop NaN rows.                                                              
df = pd.read_csv("data.csv", header=None)                                                               
df = df.dropna()                                                                                        
print(df, "n")                                                                                         
                                                                                                        
# Label rows by week number, and use this label as index.                                               
df['WeekNumber'] = df[df[0] == "Week"].all(axis=1).cumsum().astype('category')                          
df = df.ffill()                                                                                         
df = df.set_index("WeekNumber")                                                                         
print(df, "n")                                                                                         
                                                                                                        
# Regroup the dataset by week number and reuse header in each group                                     
header = list(df.iloc[0])                                                                               
df = df.groupby("WeekNumber", observed=True,                                                            
                as_index=False).apply(lambda x: x[1:]).reset_index(level=0,                             
                                                                   drop=True)                           
df.columns = header                                                                                     
print(df, "n")                                                                                         
                                                                                                        
# The name "Week" in the original dataset is somewhat inaccurate, so                                    
# change the corresponding column                                                                       
df = df.rename({"Week": "City"}, axis=1)                                                                
print(df, "n") 

# Example
print(df.groupby("WeekNumber", observed=True).agg({"Monday": "sum"}))

gives

                Monday
WeekNumber            
1               100200
2            150210260
Answered By: JustLearning