Python Pandas: Split a time series per month or week
Question:
I have a time series that spans a few years, in the following format:
timestamp open high low close volume
0 2009-01-02 05:00:00 900.00 906.75 898.00 904.75 15673.0
1 2009-01-02 05:30:00 904.75 907.75 903.75 905.50 4600.0
2 2009-01-02 06:00:00 905.50 907.25 904.50 904.50 3472.0
3 2009-01-02 06:30:00 904.50 905.00 903.25 904.75 6074.0
4 2009-01-02 07:00:00 904.75 905.50 897.00 898.25 12538.0
What would be the simplest way to split that dataframe into multiple dataframes of 1 week or 1 month worth of data?
As an example, a dataframe containing 1 year of data would be split in 52 dataframes containing a week of data and returned as a list of 52 dataframes.
The data can be reconstructed with the code below:
import pandas as pd
from pandas import Timestamp
dikt={'close': {0: 904.75, 1: 905.5, 2: 904.5, 3: 904.75, 4: 898.25}, 'low': {0: 898.0, 1: 903.75, 2: 904.5, 3: 903.25, 4: 897.0}, 'open': {0: 900.0, 1: 904.75, 2: 905.5, 3: 904.5, 4: 904.75}, 'high': {0: 906.75, 1: 907.75, 2: 907.25, 3: 905.0, 4: 905.5}, 'volume': {0: 15673.0, 1: 4600.0, 2: 3472.0, 3: 6074.0, 4: 12538.0}, 'timestamp': {0: Timestamp('2009-01-02 05:00:00'), 1: Timestamp('2009-01-02 05:30:00'), 2: Timestamp('2009-01-02 06:00:00'), 3: Timestamp('2009-01-02 06:30:00'), 4: Timestamp('2009-01-02 07:00:00')}}
df = pd.DataFrame(dikt, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])
Answers:
Convert the timestamp
column into DateTimeIndex, then you can slice into it in a variety of ways.
I would use group by for this, assume df stores the data
df = df.set_index('timestamp')
df.groupby(pd.TimeGrouper(freq='D'))
then resulting groups would contain all the dataframes you are looking for.
this answer is referenced here
use groupby
with pd.TimeGrouper
and list comprehensions
weeks = [g for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('W'))]
months = [g for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('M'))]
You can reset the index if you need
weeks = [g.reset_index()
for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('W'))]
months = [g.reset_index()
for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('M'))]
in a dict
weeks = {n: g.reset_index()
for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('W'))}
months = {n: g.reset_index()
for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('M'))}
The pd.TimeGrouper
is deprecated and will be removed, you can use pd.Grouper
instead.
weeks = [g for n, g in df.groupby(pd.Grouper(key='timestamp',freq='W'))]
months = [g for n, g in df.groupby(pd.Grouper(key='timestamp',freq='M'))]
This way you can also avoid setting the timestamp
as index.
Also, if your timestamp is part of a multi index, you can refer to it using using the level
parameter (e.g. pd.Grouper(level='timestamp', freq='W')
). Than @jtromans for the heads up.
The concept of TimeGrouper
is correct, but the syntax doesn’t seem to be working with latest versions on pandas. Here’s my working code on Pandas 1.1.3
df_Time = df.copy()
df_Time = df_Time.groupby(pd.Grouper(key='time', freq='M')).agg({
'polarity': 'mean',
})
The pd.Grouper(key='time', freq='M')
is what you need. key
is the column where the time/timestamp exists and the freq
can take multiple values with very useful options. The full list of Offset aliases (frequency options) can be found here
The main ones are
B: business day frequency
C: custom business day frequency
D: calendar day frequency
W: weekly frequency
M: month end frequency
This should fix it.
Load your data and parse the date
import pandas as pd
data = pd.read_csv(f"../Data/2022/2022_02.csv",
delimiter=',', parse_dates=["Timestamp"])
You can add date_parser=pd.to_datetime
to parse the dates as dateTime
weeks = [week for stamp, week in data.resample("W")]
months = [month for stamp, month in data.resample("M")]
In the weeks array, each item is a pandas dataframe (same for the month).
You can view it by using weeks[0]
I have a time series that spans a few years, in the following format:
timestamp open high low close volume
0 2009-01-02 05:00:00 900.00 906.75 898.00 904.75 15673.0
1 2009-01-02 05:30:00 904.75 907.75 903.75 905.50 4600.0
2 2009-01-02 06:00:00 905.50 907.25 904.50 904.50 3472.0
3 2009-01-02 06:30:00 904.50 905.00 903.25 904.75 6074.0
4 2009-01-02 07:00:00 904.75 905.50 897.00 898.25 12538.0
What would be the simplest way to split that dataframe into multiple dataframes of 1 week or 1 month worth of data?
As an example, a dataframe containing 1 year of data would be split in 52 dataframes containing a week of data and returned as a list of 52 dataframes.
The data can be reconstructed with the code below:
import pandas as pd
from pandas import Timestamp
dikt={'close': {0: 904.75, 1: 905.5, 2: 904.5, 3: 904.75, 4: 898.25}, 'low': {0: 898.0, 1: 903.75, 2: 904.5, 3: 903.25, 4: 897.0}, 'open': {0: 900.0, 1: 904.75, 2: 905.5, 3: 904.5, 4: 904.75}, 'high': {0: 906.75, 1: 907.75, 2: 907.25, 3: 905.0, 4: 905.5}, 'volume': {0: 15673.0, 1: 4600.0, 2: 3472.0, 3: 6074.0, 4: 12538.0}, 'timestamp': {0: Timestamp('2009-01-02 05:00:00'), 1: Timestamp('2009-01-02 05:30:00'), 2: Timestamp('2009-01-02 06:00:00'), 3: Timestamp('2009-01-02 06:30:00'), 4: Timestamp('2009-01-02 07:00:00')}}
df = pd.DataFrame(dikt, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])
Convert the timestamp
column into DateTimeIndex, then you can slice into it in a variety of ways.
I would use group by for this, assume df stores the data
df = df.set_index('timestamp')
df.groupby(pd.TimeGrouper(freq='D'))
then resulting groups would contain all the dataframes you are looking for.
this answer is referenced here
use groupby
with pd.TimeGrouper
and list comprehensions
weeks = [g for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('W'))]
months = [g for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('M'))]
You can reset the index if you need
weeks = [g.reset_index()
for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('W'))]
months = [g.reset_index()
for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('M'))]
in a dict
weeks = {n: g.reset_index()
for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('W'))}
months = {n: g.reset_index()
for n, g in df.set_index('timestamp').groupby(pd.TimeGrouper('M'))}
The pd.TimeGrouper
is deprecated and will be removed, you can use pd.Grouper
instead.
weeks = [g for n, g in df.groupby(pd.Grouper(key='timestamp',freq='W'))]
months = [g for n, g in df.groupby(pd.Grouper(key='timestamp',freq='M'))]
This way you can also avoid setting the timestamp
as index.
Also, if your timestamp is part of a multi index, you can refer to it using using the level
parameter (e.g. pd.Grouper(level='timestamp', freq='W')
). Than @jtromans for the heads up.
The concept of TimeGrouper
is correct, but the syntax doesn’t seem to be working with latest versions on pandas. Here’s my working code on Pandas 1.1.3
df_Time = df.copy()
df_Time = df_Time.groupby(pd.Grouper(key='time', freq='M')).agg({
'polarity': 'mean',
})
The pd.Grouper(key='time', freq='M')
is what you need. key
is the column where the time/timestamp exists and the freq
can take multiple values with very useful options. The full list of Offset aliases (frequency options) can be found here
The main ones are
B: business day frequency
C: custom business day frequency
D: calendar day frequency
W: weekly frequency
M: month end frequency
This should fix it.
Load your data and parse the date
import pandas as pd
data = pd.read_csv(f"../Data/2022/2022_02.csv",
delimiter=',', parse_dates=["Timestamp"])
You can add date_parser=pd.to_datetime
to parse the dates as dateTime
weeks = [week for stamp, week in data.resample("W")]
months = [month for stamp, month in data.resample("M")]
In the weeks array, each item is a pandas dataframe (same for the month).
You can view it by using weeks[0]