pandas isn't recognising my datetime column
Question:
I exported this from a postgres table as a tab-separated csv, like so:
copy (select * from mytable) to 'labels.csv' csv DELIMITER E't' header
Which is (file head)
user_id session_id start_time mode
2 715 2016-04-01 01:07:49+01 car
2 716 2016-04-01 03:09:53+01 car
2 1082 2016-04-02 13:05:16+01 car
2 1090 2016-04-02 15:16:32+01 car
I read this into pandas and wanted to remove timezone info, this way:
df = pd.read_csv('labels.csv', sep='t',parse_dates=['start_time'])
df['start_time'] = df['start_time'].dt.tz_localize(None)
But gives the error:
AttributeError: Can only use .dt accessor with datetimelike values
df.head()
gives:
user_id session_id start_time mode
0 2 715 2016-04-01 01:07:49+01:00 car
1 2 716 2016-04-01 03:09:53+01:00 car
2 2 1082 2016-04-02 13:05:16+01:00 car
3 2 1090 2016-04-02 15:16:32+01:00 car
4 2 1601 2016-04-04 13:56:13+01:00 foot
However,
df.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 5374 non-null int64
1 session_id 5374 non-null int64
2 start_time 5374 non-null object
3 transportation_mode 5374 non-null object
dtypes: int64(3), object(2)
Answers:
See the docs for pd.read_csv
:
parse_dates
: bool or list of int or names or list of lists or dict, default False
…
If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime
after pd.read_csv
. To parse an index or column with a mixture of timezones, specify date_parser
to be a partially-applied pd.to_datetime
with utc=True
. See Parsing a CSV with mixed timezones for more.
You likely have an unparseable date in your data. Try to coerce to datetime after you read using pandas.to_datetime
, to cause an error on the bad value, as this will raise errors on bad values by default:
df["start_time"] = pd.to_datetime(df["start_time"])
Once you identify the issue, you can then handle the value in your code. Something like:
# explicitly handle known invalid values
df["start_time"] = df["start_time"].replace({"--": pd.NaT})
df["start_time"] = pd.to_datetime(df["start_time"])
I exported this from a postgres table as a tab-separated csv, like so:
copy (select * from mytable) to 'labels.csv' csv DELIMITER E't' header
Which is (file head)
user_id session_id start_time mode
2 715 2016-04-01 01:07:49+01 car
2 716 2016-04-01 03:09:53+01 car
2 1082 2016-04-02 13:05:16+01 car
2 1090 2016-04-02 15:16:32+01 car
I read this into pandas and wanted to remove timezone info, this way:
df = pd.read_csv('labels.csv', sep='t',parse_dates=['start_time'])
df['start_time'] = df['start_time'].dt.tz_localize(None)
But gives the error:
AttributeError: Can only use .dt accessor with datetimelike values
df.head()
gives:
user_id session_id start_time mode
0 2 715 2016-04-01 01:07:49+01:00 car
1 2 716 2016-04-01 03:09:53+01:00 car
2 2 1082 2016-04-02 13:05:16+01:00 car
3 2 1090 2016-04-02 15:16:32+01:00 car
4 2 1601 2016-04-04 13:56:13+01:00 foot
However,
df.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 5374 non-null int64
1 session_id 5374 non-null int64
2 start_time 5374 non-null object
3 transportation_mode 5374 non-null object
dtypes: int64(3), object(2)
See the docs for pd.read_csv
:
parse_dates
: bool or list of int or names or list of lists or dict, default False…
If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use
pd.to_datetime
afterpd.read_csv
. To parse an index or column with a mixture of timezones, specifydate_parser
to be a partially-appliedpd.to_datetime
withutc=True
. See Parsing a CSV with mixed timezones for more.
You likely have an unparseable date in your data. Try to coerce to datetime after you read using pandas.to_datetime
, to cause an error on the bad value, as this will raise errors on bad values by default:
df["start_time"] = pd.to_datetime(df["start_time"])
Once you identify the issue, you can then handle the value in your code. Something like:
# explicitly handle known invalid values
df["start_time"] = df["start_time"].replace({"--": pd.NaT})
df["start_time"] = pd.to_datetime(df["start_time"])