Pandas Dataframe – Droping Certain Hours of the Day from 20 Years of Historical Data
Question:
I have stock market data for a single security going back 20 years. The data is currently in an Pandas DataFrame, in the following format:
The problem is, I do not want any “after hours” trading data in my DataFrame. The market in question is open from 9:30AM to 4PM (09:30 to 16:00 on each trading day). I would like to drop all rows of data that are not within this time frame.
My instinct is to use a Pandas mask, which I know how to do if I wanted certain hours in a single day:
mask = (df['date'] > '2015-07-06 09:30:0') & (df['date'] <= '2015-07-06 16:00:0')
sub = df.loc[mask]
However, I have no idea how to use one on a revolving basis to remove the data for certain times of day over a 20 year period.
Answers:
Problem here is how you are importing data. There is no indicator whether 04:00 is am or pm? but based on your comments we need to assume it is PM. However input is showing it as AM.
To solve this we need to include two conditions with OR clause.
- 9:30-11:59
- 0:00-4:00
Input:
df = pd.DataFrame({'date': {880551: '2015-07-06 04:00:00', 880552: '2015-07-06 04:02:00',880553: '2015-07-06 04:03:00', 880554: '2015-07-06 04:04:00', 880555: '2015-07-06 04:05:00'},
'open': {880551: 125.00, 880552: 125.36,880553: 125.34, 880554: 125.08, 880555: 125.12},
'high': {880551: 125.00, 880552: 125.36,880553: 125.34, 880554: 125.11, 880555: 125.12},
'low': {880551: 125.00, 880552: 125.32,880553: 125.21, 880554: 125.05, 880555: 125.12},
'close': {880551: 125.00, 880552: 125.32,880553: 125.21, 880554: 125.05, 880555: 125.12},
'volume': {880551: 141, 880552: 200,880553: 750, 880554: 17451, 880555: 1000},
},
)
df.head()
date open high low close volume
880551 2015-07-06 04:00:00 125.00 125.00 125.00 125.00 141
880552 2015-07-06 04:02:00 125.36 125.36 125.32 125.32 200
880553 2015-07-06 04:03:00 125.34 125.34 125.21 125.21 750
880554 2015-07-06 04:04:00 125.08 125.11 125.05 125.05 17451
880555 2015-07-06 04:05:00 125.12 125.12 125.12 125.12 1000
from datetime import time
start_first = time(9, 30)
end_first = time(11, 59)
start_second = time(0, 00)
end_second = time(4,00)
df['date'] = pd.to_datetime(df['date'])
df= df[(df['date'].dt.time.between(start_first, end_first)) | (df['date'].dt.time.between(start_second, end_second))]
df
date open high low close volume
880551 2015-07-06 04:00:00 125.0 125.0 125.0 125.0 141
Above is not good practice, and I strongly discourage to use this kind of ambiguous data. long time solution is to correctly populate data with am/pm.
We can achieve it in two way in case of correct data format:
1) using datetime
from datetime import time
start = time(9, 30)
end = time(16)
df['date'] = pd.to_datetime(df['date'])
df= df[df['date'].dt.time.between(start, end)]
2) using between time, which only works with datetime index
df['date'] = pd.to_datetime(df['date'])
df = (df.set_index('date')
.between_time('09:30', '16:00')
.reset_index())
If you still face error, edit your question with line by line approach and exact error.
I think the answer is already in the comments (@Parfait’s .between_time) but that it got lost in debugging issues. It appears your df['date']
column is not of type Datetime
yet.
This should be enough to fix that and get the required result:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.between_time('9:30', '16:00')
This example code consolidates the answers provided by Bhavesh Ghodasara, Parfait and jorijnsmit into one complete, commented example:
import pandas as pd
# example dataframe containing 6 records: 2 days of 3 records each in which all cases are covered:
# each day has one record before trading hours, one record during trading hours and one recrod after trading hours
df = pd.DataFrame({'date': {0: '2015-07-06 08:00:00', 1: '2015-07-06 13:00:00', 2: '2015-07-06 18:00:00',
3: '2015-07-07 08:00:00', 4: '2015-07-07 13:00:00', 5: '2015-07-07 18:00:00'},
'open': {0: 125.00, 1: 125.36, 2: 125.34, 3: 125.08, 4: 125.12, 5: 125.37},
'high': {0: 125.00, 1: 125.36, 2: 125.34, 3: 125.08, 4: 125.12, 5: 125.37},
'low': {0: 125.00, 1: 125.36, 2: 125.34, 3: 125.08, 4: 125.12, 5: 125.37},
'close': {0: 125.00, 1: 125.36, 2: 125.34, 3: 125.08, 4: 125.12, 5: 125.37},
'volume': {0: 141, 1: 200, 2: 750, 3: 17451, 4: 1000, 5: 38234},
},
)
# inspect the example data set
df.head(6)
# first, ensure that the 'date' column is of the correct data type: MAKE IT SO!
df['date'] = pd.to_datetime(df['date'])
# inspect the data types: date column should be of type 'datetime64[ns]'
print(df.dtypes)
# set the index of the dataframe to the datetime-type column 'data'
df = df.set_index('date')
# inspect the index: it should be a DatetimeIndex of dtype 'datetime64[ns]'
print(df.index)
# filter the data set
df_filtered = df.between_time('9:30', '16:00')
# inspect the filtered data set: VoilĂ ! No more outside trading hours records.
df_filtered.head()
All of the previous answers are ignoring one important fact – Daylight saving.
Assuming your data is in UTC time zone, the opening and closing hours NYSE are different depending on DST.
Just filtering your data with df.between_time("09:30","16:30")
is wrong. You should be aware of the NYSE’s schedule on any given day.
Fortunately, The pip package pandas_market_calendars
is making this much easier to handle.
import pandas_market_calendars as mcal
nyse = mcal.get_calendar('NYSE')
nyse.schedule(start_date='2022-03-10', end_date='2022-03-20')
This will result in
2022-03-10 2022-03-10 14:30:00+00:00 2022-03-10 21:00:00+00:00
2022-03-11 2022-03-11 14:30:00+00:00 2022-03-11 21:00:00+00:00
2022-03-14 2022-03-14 13:30:00+00:00 2022-03-14 20:00:00+00:00
2022-03-15 2022-03-15 13:30:00+00:00 2022-03-15 20:00:00+00:00
2022-03-16 2022-03-16 13:30:00+00:00 2022-03-16 20:00:00+00:00
2022-03-17 2022-03-17 13:30:00+00:00 2022-03-17 20:00:00+00:00
2022-03-18 2022-03-18 13:30:00+00:00 2022-03-18 20:00:00+00:00
You can use this output to create one index that contains all minutes between market_open
and market_close
of each day.
Note: This piece of code for sure can be done better, but it still runs pretty fast.
hours = []
for i, row in nyse_scehdule.iterrows():
hours.append(pd.date_range(start=row['market_open'], end=row['market_close'], tz="UTC", freq="1min").to_series())
hours_index = pd.concat(hours).index
Now you can just reindex your original dataframe by this new index:
data.reindex(hours_index)
Hope this helps.
I have stock market data for a single security going back 20 years. The data is currently in an Pandas DataFrame, in the following format:
The problem is, I do not want any “after hours” trading data in my DataFrame. The market in question is open from 9:30AM to 4PM (09:30 to 16:00 on each trading day). I would like to drop all rows of data that are not within this time frame.
My instinct is to use a Pandas mask, which I know how to do if I wanted certain hours in a single day:
mask = (df['date'] > '2015-07-06 09:30:0') & (df['date'] <= '2015-07-06 16:00:0')
sub = df.loc[mask]
However, I have no idea how to use one on a revolving basis to remove the data for certain times of day over a 20 year period.
Problem here is how you are importing data. There is no indicator whether 04:00 is am or pm? but based on your comments we need to assume it is PM. However input is showing it as AM.
To solve this we need to include two conditions with OR clause.
- 9:30-11:59
- 0:00-4:00
Input:
df = pd.DataFrame({'date': {880551: '2015-07-06 04:00:00', 880552: '2015-07-06 04:02:00',880553: '2015-07-06 04:03:00', 880554: '2015-07-06 04:04:00', 880555: '2015-07-06 04:05:00'},
'open': {880551: 125.00, 880552: 125.36,880553: 125.34, 880554: 125.08, 880555: 125.12},
'high': {880551: 125.00, 880552: 125.36,880553: 125.34, 880554: 125.11, 880555: 125.12},
'low': {880551: 125.00, 880552: 125.32,880553: 125.21, 880554: 125.05, 880555: 125.12},
'close': {880551: 125.00, 880552: 125.32,880553: 125.21, 880554: 125.05, 880555: 125.12},
'volume': {880551: 141, 880552: 200,880553: 750, 880554: 17451, 880555: 1000},
},
)
df.head()
date open high low close volume
880551 2015-07-06 04:00:00 125.00 125.00 125.00 125.00 141
880552 2015-07-06 04:02:00 125.36 125.36 125.32 125.32 200
880553 2015-07-06 04:03:00 125.34 125.34 125.21 125.21 750
880554 2015-07-06 04:04:00 125.08 125.11 125.05 125.05 17451
880555 2015-07-06 04:05:00 125.12 125.12 125.12 125.12 1000
from datetime import time
start_first = time(9, 30)
end_first = time(11, 59)
start_second = time(0, 00)
end_second = time(4,00)
df['date'] = pd.to_datetime(df['date'])
df= df[(df['date'].dt.time.between(start_first, end_first)) | (df['date'].dt.time.between(start_second, end_second))]
df
date open high low close volume
880551 2015-07-06 04:00:00 125.0 125.0 125.0 125.0 141
Above is not good practice, and I strongly discourage to use this kind of ambiguous data. long time solution is to correctly populate data with am/pm.
We can achieve it in two way in case of correct data format:
1) using datetime
from datetime import time
start = time(9, 30)
end = time(16)
df['date'] = pd.to_datetime(df['date'])
df= df[df['date'].dt.time.between(start, end)]
2) using between time, which only works with datetime index
df['date'] = pd.to_datetime(df['date'])
df = (df.set_index('date')
.between_time('09:30', '16:00')
.reset_index())
If you still face error, edit your question with line by line approach and exact error.
I think the answer is already in the comments (@Parfait’s .between_time) but that it got lost in debugging issues. It appears your df['date']
column is not of type Datetime
yet.
This should be enough to fix that and get the required result:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.between_time('9:30', '16:00')
This example code consolidates the answers provided by Bhavesh Ghodasara, Parfait and jorijnsmit into one complete, commented example:
import pandas as pd
# example dataframe containing 6 records: 2 days of 3 records each in which all cases are covered:
# each day has one record before trading hours, one record during trading hours and one recrod after trading hours
df = pd.DataFrame({'date': {0: '2015-07-06 08:00:00', 1: '2015-07-06 13:00:00', 2: '2015-07-06 18:00:00',
3: '2015-07-07 08:00:00', 4: '2015-07-07 13:00:00', 5: '2015-07-07 18:00:00'},
'open': {0: 125.00, 1: 125.36, 2: 125.34, 3: 125.08, 4: 125.12, 5: 125.37},
'high': {0: 125.00, 1: 125.36, 2: 125.34, 3: 125.08, 4: 125.12, 5: 125.37},
'low': {0: 125.00, 1: 125.36, 2: 125.34, 3: 125.08, 4: 125.12, 5: 125.37},
'close': {0: 125.00, 1: 125.36, 2: 125.34, 3: 125.08, 4: 125.12, 5: 125.37},
'volume': {0: 141, 1: 200, 2: 750, 3: 17451, 4: 1000, 5: 38234},
},
)
# inspect the example data set
df.head(6)
# first, ensure that the 'date' column is of the correct data type: MAKE IT SO!
df['date'] = pd.to_datetime(df['date'])
# inspect the data types: date column should be of type 'datetime64[ns]'
print(df.dtypes)
# set the index of the dataframe to the datetime-type column 'data'
df = df.set_index('date')
# inspect the index: it should be a DatetimeIndex of dtype 'datetime64[ns]'
print(df.index)
# filter the data set
df_filtered = df.between_time('9:30', '16:00')
# inspect the filtered data set: VoilĂ ! No more outside trading hours records.
df_filtered.head()
All of the previous answers are ignoring one important fact – Daylight saving.
Assuming your data is in UTC time zone, the opening and closing hours NYSE are different depending on DST.
Just filtering your data with df.between_time("09:30","16:30")
is wrong. You should be aware of the NYSE’s schedule on any given day.
Fortunately, The pip package pandas_market_calendars
is making this much easier to handle.
import pandas_market_calendars as mcal
nyse = mcal.get_calendar('NYSE')
nyse.schedule(start_date='2022-03-10', end_date='2022-03-20')
This will result in
2022-03-10 2022-03-10 14:30:00+00:00 2022-03-10 21:00:00+00:00
2022-03-11 2022-03-11 14:30:00+00:00 2022-03-11 21:00:00+00:00
2022-03-14 2022-03-14 13:30:00+00:00 2022-03-14 20:00:00+00:00
2022-03-15 2022-03-15 13:30:00+00:00 2022-03-15 20:00:00+00:00
2022-03-16 2022-03-16 13:30:00+00:00 2022-03-16 20:00:00+00:00
2022-03-17 2022-03-17 13:30:00+00:00 2022-03-17 20:00:00+00:00
2022-03-18 2022-03-18 13:30:00+00:00 2022-03-18 20:00:00+00:00
You can use this output to create one index that contains all minutes between market_open
and market_close
of each day.
Note: This piece of code for sure can be done better, but it still runs pretty fast.
hours = []
for i, row in nyse_scehdule.iterrows():
hours.append(pd.date_range(start=row['market_open'], end=row['market_close'], tz="UTC", freq="1min").to_series())
hours_index = pd.concat(hours).index
Now you can just reindex your original dataframe by this new index:
data.reindex(hours_index)
Hope this helps.