How to process data into 20Minute aggregates in Python
Question:
I have the folowing table
TimeStamp
Name
Marks
Subject
2022-01-01 00:00:02.969
Chris
70
DK
2022-01-01 00:00:04.467
Chris
75
DK
2022-01-01 00:00:05.965
Mark
80
DK
2022-01-01 00:00:08.962
Cuban
60
DK
2022-01-01 00:00:10.461
Cuban
58
DK
I want to aggregate the table for each column into 20minute aggregate which includes max, min, values
Expected output
TimeStamp
Subject
Chris_Min
Chris_Max
Chris_STD
Mark_Min
Mark_Max
Mark_STD
2022-01-01 00:00:00.000
DK
70
75
2022-01-01 00:20:00.000
DK
etc
etc
2022-01-01 00:40:00.000
DK
etc
etc
I am having hard time aggregating the data into required output.
The agggregation should be dynamic so as to change to 10min or 30min.
I tried using bins to do it, but not getting the desired results.
Please Help.
Answers:
Is your table a pandas dataframe ?
If it’s a pandas dataframe you can use resample:
# only if timestamp is not the index yet:
df = df.set_index('TimeStamp')
# the important part, you can use any function in agg or some str for simple
# functions like mean:
df = df.resample('10Min').agg('max','min')
# only if you had to set index to timestamp and want to go back to normal index:
df = df.reset_index()
Edit to get second table in the function:
# choose aggregation function
agg_functions = ['min', 'max', 'std']
# set_index on time column, resample
resampled_df = df.set_index('TimeStamp').resample('10Min').agg(agg_functions)
# flatten multiindex
resampled_df.columns = resampled_df.columns.map('_'.join)
# drop time column
resampled_df = resampled_df.reset_index(drop=True)
# concatenate with original df
pd.concat([df, resampled_df], axis=1)
You could try the following:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg(Min=("Marks", "min"), Max=("Marks", "max"), STD=("Marks", "std"))
.unstack(0)
.swaplevel(0, 1).reset_index()
)
- First setting
TimeStamp
as index, and grouping by Subject
and Name
to get the right chunks to work on.
- Then
.resampling()
the groups with the given frequency rule
.
- Then aggregating the required stats by using
.agg()
with named tuples.
- Unstacking the first index level (
Name
) to get it in the columns.
- Swapping the remaining index levels to get the right order when finally resetting the index.
Result for the given sample:
TimeStamp Subject Min Max STD
Name Chris Cuban Mark Chris Cuban Mark Chris Cuban Mark
0 2022-01-01 DK 70 58 80 75 60 80 3.535534 1.414214 NaN
If you want the columns exactly like in your expected output then you could add the following
result = result[
list(result.columns[:2]) + sorted(result.columns[2:], key=lambda c: c[1])
]
result.columns = [f"{lev1}_{lev0}" if lev1 else lev0 for lev0, lev1 in result.columns]
to get
TimeStamp Subject Chris_Min Chris_Max ... Cuban_STD Mark_Min Mark_Max Mark_STD
0 2022-01-01 DK 70 75 ... 1.414214 80 80 NaN
If you’re getting the TypeError: aggregate() missing 1 required positional argument...
error (the comment is gone), then it could be that you’re working with an older Pandas version that can’t deal with named tuples. You could try the following instead:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg({"Marks": ["min", "max", "std"]})
.droplevel(0, axis=1)
.unstack(0)
.swaplevel(0, 1).reset_index()
)
...
I have the folowing table
TimeStamp | Name | Marks | Subject |
---|---|---|---|
2022-01-01 00:00:02.969 | Chris | 70 | DK |
2022-01-01 00:00:04.467 | Chris | 75 | DK |
2022-01-01 00:00:05.965 | Mark | 80 | DK |
2022-01-01 00:00:08.962 | Cuban | 60 | DK |
2022-01-01 00:00:10.461 | Cuban | 58 | DK |
I want to aggregate the table for each column into 20minute aggregate which includes max, min, values
Expected output
TimeStamp | Subject | Chris_Min | Chris_Max | Chris_STD | Mark_Min | Mark_Max | Mark_STD |
---|---|---|---|---|---|---|---|
2022-01-01 00:00:00.000 | DK | 70 | 75 | ||||
2022-01-01 00:20:00.000 | DK | etc | etc | ||||
2022-01-01 00:40:00.000 | DK | etc | etc |
I am having hard time aggregating the data into required output.
The agggregation should be dynamic so as to change to 10min or 30min.
I tried using bins to do it, but not getting the desired results.
Please Help.
Is your table a pandas dataframe ?
If it’s a pandas dataframe you can use resample:
# only if timestamp is not the index yet:
df = df.set_index('TimeStamp')
# the important part, you can use any function in agg or some str for simple
# functions like mean:
df = df.resample('10Min').agg('max','min')
# only if you had to set index to timestamp and want to go back to normal index:
df = df.reset_index()
Edit to get second table in the function:
# choose aggregation function
agg_functions = ['min', 'max', 'std']
# set_index on time column, resample
resampled_df = df.set_index('TimeStamp').resample('10Min').agg(agg_functions)
# flatten multiindex
resampled_df.columns = resampled_df.columns.map('_'.join)
# drop time column
resampled_df = resampled_df.reset_index(drop=True)
# concatenate with original df
pd.concat([df, resampled_df], axis=1)
You could try the following:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg(Min=("Marks", "min"), Max=("Marks", "max"), STD=("Marks", "std"))
.unstack(0)
.swaplevel(0, 1).reset_index()
)
- First setting
TimeStamp
as index, and grouping bySubject
andName
to get the right chunks to work on. - Then
.resampling()
the groups with the given frequencyrule
. - Then aggregating the required stats by using
.agg()
with named tuples. - Unstacking the first index level (
Name
) to get it in the columns. - Swapping the remaining index levels to get the right order when finally resetting the index.
Result for the given sample:
TimeStamp Subject Min Max STD
Name Chris Cuban Mark Chris Cuban Mark Chris Cuban Mark
0 2022-01-01 DK 70 58 80 75 60 80 3.535534 1.414214 NaN
If you want the columns exactly like in your expected output then you could add the following
result = result[
list(result.columns[:2]) + sorted(result.columns[2:], key=lambda c: c[1])
]
result.columns = [f"{lev1}_{lev0}" if lev1 else lev0 for lev0, lev1 in result.columns]
to get
TimeStamp Subject Chris_Min Chris_Max ... Cuban_STD Mark_Min Mark_Max Mark_STD
0 2022-01-01 DK 70 75 ... 1.414214 80 80 NaN
If you’re getting the TypeError: aggregate() missing 1 required positional argument...
error (the comment is gone), then it could be that you’re working with an older Pandas version that can’t deal with named tuples. You could try the following instead:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg({"Marks": ["min", "max", "std"]})
.droplevel(0, axis=1)
.unstack(0)
.swaplevel(0, 1).reset_index()
)
...