Iterate through chunks of a pandas Dataframe
Question:
I have a pandas.DataFrame that looks like the following:
Week
Monday
Tuesday
Wednesday
Thursday
Friday
City A
100
300
x
z
w
City B
200
400
y
q
p
None
None
None
None
None
None
Week
Monday
Tuesday
Wednesday
Thursday
Friday
City A
150
320
a
c
e
City B
210
470
z
t
q
City C
260
446
b
d
f
None
None
None
None
None
None
This repeats until all weeks in a year are covered (it’s basically a weekly calendar with data in it).
I wish to loop through the DataFrame in chunks, and do some operations with the data within those chunks.
The chunks should be basically "Week-to-Week"-high and "Week-to-Friday"-wide, if that makes sense. However, as you can see, the chunks are not equally large so I can’t hard code the size to be 4×6, for example. They do, although, always go from "Week" to "Week" and go out as far to the right as "Friday".
Is there any intuitive way I can iterate through my DataFrame? Any help is appreciated.
Answers:
You can try:
df['week_index'] = df.isna().all(axis='columns').astype(int).cumsum()
for _, df_chunk in df.groupby('week_index'):
# do something
To do it week-to-week:
df['week_index'] = df.isna().all(axis='columns').astype(int).shift(1, fill_value=0).cumsum()
for _, df_chunk in df.groupby('week_index'):
# process each chunk
Reproducing your data with the CSV file,
# data.csv
Week,Monday,Tuesday,Wednesday,Thursday,Friday
City 100,300,x,z,w
City B,200,400,y,q,p
None,None,None,None,None,None
Week,Monday,Tuesday,Wednesday,Thursday,Friday
City A,150,320,a,c,e
City B,210,470,z,t,q
City C,260,446,b,d,f
None,None,None,None,None,None
you can do the following in order to clean the original dataset and turn it into something more useful for groups and aggregations:
import pandas as pd
# Reproduce your data, then drop NaN rows.
df = pd.read_csv("data.csv", header=None)
df = df.dropna()
print(df, "n")
# Label rows by week number, and use this label as index.
df['WeekNumber'] = df[df[0] == "Week"].all(axis=1).cumsum().astype('category')
df = df.ffill()
df = df.set_index("WeekNumber")
print(df, "n")
# Regroup the dataset by week number and reuse header in each group
header = list(df.iloc[0])
df = df.groupby("WeekNumber", observed=True,
as_index=False).apply(lambda x: x[1:]).reset_index(level=0,
drop=True)
df.columns = header
print(df, "n")
# The name "Week" in the original dataset is somewhat inaccurate, so
# change the corresponding column
df = df.rename({"Week": "City"}, axis=1)
print(df, "n")
# Example
print(df.groupby("WeekNumber", observed=True).agg({"Monday": "sum"}))
gives
Monday
WeekNumber
1 100200
2 150210260
I have a pandas.DataFrame that looks like the following:
Week | Monday | Tuesday | Wednesday | Thursday | Friday |
---|---|---|---|---|---|
City A | 100 | 300 | x | z | w |
City B | 200 | 400 | y | q | p |
None | None | None | None | None | None |
Week | Monday | Tuesday | Wednesday | Thursday | Friday |
City A | 150 | 320 | a | c | e |
City B | 210 | 470 | z | t | q |
City C | 260 | 446 | b | d | f |
None | None | None | None | None | None |
This repeats until all weeks in a year are covered (it’s basically a weekly calendar with data in it).
I wish to loop through the DataFrame in chunks, and do some operations with the data within those chunks.
The chunks should be basically "Week-to-Week"-high and "Week-to-Friday"-wide, if that makes sense. However, as you can see, the chunks are not equally large so I can’t hard code the size to be 4×6, for example. They do, although, always go from "Week" to "Week" and go out as far to the right as "Friday".
Is there any intuitive way I can iterate through my DataFrame? Any help is appreciated.
You can try:
df['week_index'] = df.isna().all(axis='columns').astype(int).cumsum()
for _, df_chunk in df.groupby('week_index'):
# do something
To do it week-to-week:
df['week_index'] = df.isna().all(axis='columns').astype(int).shift(1, fill_value=0).cumsum()
for _, df_chunk in df.groupby('week_index'):
# process each chunk
Reproducing your data with the CSV file,
# data.csv
Week,Monday,Tuesday,Wednesday,Thursday,Friday
City 100,300,x,z,w
City B,200,400,y,q,p
None,None,None,None,None,None
Week,Monday,Tuesday,Wednesday,Thursday,Friday
City A,150,320,a,c,e
City B,210,470,z,t,q
City C,260,446,b,d,f
None,None,None,None,None,None
you can do the following in order to clean the original dataset and turn it into something more useful for groups and aggregations:
import pandas as pd
# Reproduce your data, then drop NaN rows.
df = pd.read_csv("data.csv", header=None)
df = df.dropna()
print(df, "n")
# Label rows by week number, and use this label as index.
df['WeekNumber'] = df[df[0] == "Week"].all(axis=1).cumsum().astype('category')
df = df.ffill()
df = df.set_index("WeekNumber")
print(df, "n")
# Regroup the dataset by week number and reuse header in each group
header = list(df.iloc[0])
df = df.groupby("WeekNumber", observed=True,
as_index=False).apply(lambda x: x[1:]).reset_index(level=0,
drop=True)
df.columns = header
print(df, "n")
# The name "Week" in the original dataset is somewhat inaccurate, so
# change the corresponding column
df = df.rename({"Week": "City"}, axis=1)
print(df, "n")
# Example
print(df.groupby("WeekNumber", observed=True).agg({"Monday": "sum"}))
gives
Monday
WeekNumber
1 100200
2 150210260