Pandas does not respect conversion to time type
Question:
I have this dataframe:
site date time
1 AA 2018-01-01 0100
2 AA 2018-01-01 0200
3 AA 2018-01-01 0300
4 AA 2018-01-01 0400
5 AA 2018-01-01 0500
6 AA 2018-01-01 0600
7 AA 2018-01-01 0700
8 AA 2018-01-01 0800
9 AA 2018-01-01 0900
df.dtypes
>>> site object
date datetime64[ns]
time object
I would like to convert the time
column to time
type (without date) to later filter the dataframe between desired hours. So I did:
df['time'] = df['time'].apply(lambda x: pd.to_datetime(x, format='%H%M').time())
The dataframe now looks like this:
site date time
1 AA 2018-01-01 01:00:00
2 AA 2018-01-01 02:00:00
3 AA 2018-01-01 03:00:00
4 AA 2018-01-01 04:00:00
5 AA 2018-01-01 05:00:00
6 AA 2018-01-01 06:00:00
7 AA 2018-01-01 07:00:00
8 AA 2018-01-01 08:00:00
9 AA 2018-01-01 09:00:00
However, the data type is still an object type:
df.dtypes
>>> site object
date datetime64[ns]
time object
dtype: object
But, when I check the type for individual value, it does seem to work:
df.at[5,'time']
>>> datetime.time(5, 0)
type(df.at[5,'time'])
>>> datetime.time
Still, I can’t filter the data based on time:
from datetime import time
df[df['time'].between_time(time(5),time(8))]
>>> TypeError: Index must be DatetimeIndex
Answers:
The reason you see TypeError is because in the documentation for between_time
it clearly says:
Raises
TypeError
If the index is not a DatetimeIndex
You need to set the index for the dataframe as datetime index, but for that to happen your data for the index should contain datetime not just time. By using time()
you are converting it into time
object. But DatetimeIndex needs a datetime object.
One way to get the result you wanted is:
df.set_index(pd.DatetimeIndex(
pd.to_datetime(df["time"], format="%H%M"))).between_time(
time(5), time(8)
).reset_index(drop=True)
Output:
site date time
0 AA 2018-01-01 0500
1 AA 2018-01-01 0600
2 AA 2018-01-01 0700
3 AA 2018-01-01 0800
Or even you could use your date
column to create a datetime index and then use between_time
like:
df.set_index(pd.to_datetime(df['date'] + ' ' + df['time'])).between_time(
time(5), time(8)).reset_index(drop=True)
I have this dataframe:
site date time
1 AA 2018-01-01 0100
2 AA 2018-01-01 0200
3 AA 2018-01-01 0300
4 AA 2018-01-01 0400
5 AA 2018-01-01 0500
6 AA 2018-01-01 0600
7 AA 2018-01-01 0700
8 AA 2018-01-01 0800
9 AA 2018-01-01 0900
df.dtypes
>>> site object
date datetime64[ns]
time object
I would like to convert the time
column to time
type (without date) to later filter the dataframe between desired hours. So I did:
df['time'] = df['time'].apply(lambda x: pd.to_datetime(x, format='%H%M').time())
The dataframe now looks like this:
site date time
1 AA 2018-01-01 01:00:00
2 AA 2018-01-01 02:00:00
3 AA 2018-01-01 03:00:00
4 AA 2018-01-01 04:00:00
5 AA 2018-01-01 05:00:00
6 AA 2018-01-01 06:00:00
7 AA 2018-01-01 07:00:00
8 AA 2018-01-01 08:00:00
9 AA 2018-01-01 09:00:00
However, the data type is still an object type:
df.dtypes
>>> site object
date datetime64[ns]
time object
dtype: object
But, when I check the type for individual value, it does seem to work:
df.at[5,'time']
>>> datetime.time(5, 0)
type(df.at[5,'time'])
>>> datetime.time
Still, I can’t filter the data based on time:
from datetime import time
df[df['time'].between_time(time(5),time(8))]
>>> TypeError: Index must be DatetimeIndex
The reason you see TypeError is because in the documentation for between_time
it clearly says:
Raises
TypeError If the index is not a DatetimeIndex
You need to set the index for the dataframe as datetime index, but for that to happen your data for the index should contain datetime not just time. By using time()
you are converting it into time
object. But DatetimeIndex needs a datetime object.
One way to get the result you wanted is:
df.set_index(pd.DatetimeIndex(
pd.to_datetime(df["time"], format="%H%M"))).between_time(
time(5), time(8)
).reset_index(drop=True)
Output:
site date time
0 AA 2018-01-01 0500
1 AA 2018-01-01 0600
2 AA 2018-01-01 0700
3 AA 2018-01-01 0800
Or even you could use your date
column to create a datetime index and then use between_time
like:
df.set_index(pd.to_datetime(df['date'] + ' ' + df['time'])).between_time(
time(5), time(8)).reset_index(drop=True)