Resampling timeseries dataframe with multi-index
Question:
Generate data:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df['col1'] = np.random.normal(size = df.shape[0])
df['col2'] = np.random.random_integers(1, 100, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df2['col1'] = np.random.normal(size = df2.shape[0])
df2['col2'] = np.random.random_integers(1, 50, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3=df3.set_index(['index','uid'])
I am trying to resample the data to 30min intervals and assign how to aggregate the data for each uid and each column individually. I have many columns and I need to assign whether if I want the mean, median, std, max, min, for each column. Since there are duplicate timestamps I need to do this operation for each user, that’s why I try to set the multiindex and do the following:
df3.groupby(pd.Grouper(freq='30Min',closed='right',label='right')).agg({
"col1": "max", "col2": "min", 'uid':'max'})
but I get the following error
ValueError: MultiIndex has no single backing array. Use
‘MultiIndex.to_numpy()’ to get a NumPy array of tuples.
How can I do this operation?
Answers:
You have to specify the level name when you use pd.Grouper
on index:
out = (df3.groupby([pd.Grouper(level='index', freq='30T', closed='right', label='right'), 'uid'])
.agg({"col1": "max", "col2": "min"}))
print(out)
# Output
col1 col2
index uid
2020-10-01 00:00:00 1 -0.222489 77
2 -1.490019 22
2020-10-01 00:30:00 1 1.556801 16
2 0.580076 1
2020-10-01 01:00:00 1 0.745477 12
... ... ...
2020-10-02 23:00:00 2 0.272276 13
2020-10-02 23:30:00 1 0.378779 20
2 0.786048 5
2020-10-03 00:00:00 1 1.716791 20
2 1.438454 5
[194 rows x 2 columns]
Generate data:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df['col1'] = np.random.normal(size = df.shape[0])
df['col2'] = np.random.random_integers(1, 100, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df2['col1'] = np.random.normal(size = df2.shape[0])
df2['col2'] = np.random.random_integers(1, 50, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3=df3.set_index(['index','uid'])
I am trying to resample the data to 30min intervals and assign how to aggregate the data for each uid and each column individually. I have many columns and I need to assign whether if I want the mean, median, std, max, min, for each column. Since there are duplicate timestamps I need to do this operation for each user, that’s why I try to set the multiindex and do the following:
df3.groupby(pd.Grouper(freq='30Min',closed='right',label='right')).agg({
"col1": "max", "col2": "min", 'uid':'max'})
but I get the following error
ValueError: MultiIndex has no single backing array. Use
‘MultiIndex.to_numpy()’ to get a NumPy array of tuples.
How can I do this operation?
You have to specify the level name when you use pd.Grouper
on index:
out = (df3.groupby([pd.Grouper(level='index', freq='30T', closed='right', label='right'), 'uid'])
.agg({"col1": "max", "col2": "min"}))
print(out)
# Output
col1 col2
index uid
2020-10-01 00:00:00 1 -0.222489 77
2 -1.490019 22
2020-10-01 00:30:00 1 1.556801 16
2 0.580076 1
2020-10-01 01:00:00 1 0.745477 12
... ... ...
2020-10-02 23:00:00 2 0.272276 13
2020-10-02 23:30:00 1 0.378779 20
2 0.786048 5
2020-10-03 00:00:00 1 1.716791 20
2 1.438454 5
[194 rows x 2 columns]