Python pandas integer YYYYMMDD to datetime
Question:
I have a DataFrame that looks like the following:
OrdNo LstInvDt
9 20070620
11 20070830
19 20070719
21 20070719
23 20070719
26 20070911
29 20070918
31 0070816
34 20070925
LstInvDt
of dtype
int64
. As you can see, the integers are of the format YYYYMMDD
, e.g. 20070530
– 30th of May 2007. I have tried a range of approaches, the most obvious being;
pd.to_datetime(dt['Date'])
and pd.to_datetime(str(dt['Date']))
with multiple variations on the functions different parameters.
The result has been that the date interpreted as being the time. The date is set to 1970-01-01
– outcome as per above example 1970-01-01 00:00:00.020070530
I also tried various .map()
functions found in similar posts.
How do I convert it correctly?
Answers:
to_datetime
accepts a format string:
In [92]:
t = 20070530
pd.to_datetime(str(t), format='%Y%m%d')
Out[92]:
Timestamp('2007-05-30 00:00:00')
example:
In [94]:
t = 20070530
df = pd.DataFrame({'date':[t]*10})
df
Out[94]:
date
0 20070530
1 20070530
2 20070530
3 20070530
4 20070530
5 20070530
6 20070530
7 20070530
8 20070530
9 20070530
In [98]:
df['DateTime'] = df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
df
Out[98]:
date DateTime
0 20070530 2007-05-30
1 20070530 2007-05-30
2 20070530 2007-05-30
3 20070530 2007-05-30
4 20070530 2007-05-30
5 20070530 2007-05-30
6 20070530 2007-05-30
7 20070530 2007-05-30
8 20070530 2007-05-30
9 20070530 2007-05-30
In [99]:
df.dtypes
Out[99]:
date int64
DateTime datetime64[ns]
dtype: object
EDIT
Actually it’s quicker to convert the type to string and then convert the entire series to a datetime rather than calling apply on every value:
In [102]:
df['DateTime'] = pd.to_datetime(df['date'].astype(str), format='%Y%m%d')
df
Out[102]:
date DateTime
0 20070530 2007-05-30
1 20070530 2007-05-30
2 20070530 2007-05-30
3 20070530 2007-05-30
4 20070530 2007-05-30
5 20070530 2007-05-30
6 20070530 2007-05-30
7 20070530 2007-05-30
8 20070530 2007-05-30
9 20070530 2007-05-30
timings
In [104]:
%timeit df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
100 loops, best of 3: 2.55 ms per loop
In [105]:
%timeit pd.to_datetime(df['date'].astype(str), format='%Y%m%d')
1000 loops, best of 3: 396 µs per loop
You don’t need to cast to strings, pd.to_datetime()
can parse
int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
so directly calling it with the specific format=
should work.
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
One useful parameter is errors=
. By setting it to 'coerce'
, you can get NaT values for "broken" dates instead of having an error raised.
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
I have a DataFrame that looks like the following:
OrdNo LstInvDt
9 20070620
11 20070830
19 20070719
21 20070719
23 20070719
26 20070911
29 20070918
31 0070816
34 20070925
LstInvDt
of dtype
int64
. As you can see, the integers are of the format YYYYMMDD
, e.g. 20070530
– 30th of May 2007. I have tried a range of approaches, the most obvious being;
pd.to_datetime(dt['Date'])
and pd.to_datetime(str(dt['Date']))
with multiple variations on the functions different parameters.
The result has been that the date interpreted as being the time. The date is set to 1970-01-01
– outcome as per above example 1970-01-01 00:00:00.020070530
I also tried various .map()
functions found in similar posts.
How do I convert it correctly?
to_datetime
accepts a format string:
In [92]:
t = 20070530
pd.to_datetime(str(t), format='%Y%m%d')
Out[92]:
Timestamp('2007-05-30 00:00:00')
example:
In [94]:
t = 20070530
df = pd.DataFrame({'date':[t]*10})
df
Out[94]:
date
0 20070530
1 20070530
2 20070530
3 20070530
4 20070530
5 20070530
6 20070530
7 20070530
8 20070530
9 20070530
In [98]:
df['DateTime'] = df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
df
Out[98]:
date DateTime
0 20070530 2007-05-30
1 20070530 2007-05-30
2 20070530 2007-05-30
3 20070530 2007-05-30
4 20070530 2007-05-30
5 20070530 2007-05-30
6 20070530 2007-05-30
7 20070530 2007-05-30
8 20070530 2007-05-30
9 20070530 2007-05-30
In [99]:
df.dtypes
Out[99]:
date int64
DateTime datetime64[ns]
dtype: object
EDIT
Actually it’s quicker to convert the type to string and then convert the entire series to a datetime rather than calling apply on every value:
In [102]:
df['DateTime'] = pd.to_datetime(df['date'].astype(str), format='%Y%m%d')
df
Out[102]:
date DateTime
0 20070530 2007-05-30
1 20070530 2007-05-30
2 20070530 2007-05-30
3 20070530 2007-05-30
4 20070530 2007-05-30
5 20070530 2007-05-30
6 20070530 2007-05-30
7 20070530 2007-05-30
8 20070530 2007-05-30
9 20070530 2007-05-30
timings
In [104]:
%timeit df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
100 loops, best of 3: 2.55 ms per loop
In [105]:
%timeit pd.to_datetime(df['date'].astype(str), format='%Y%m%d')
1000 loops, best of 3: 396 µs per loop
You don’t need to cast to strings, pd.to_datetime()
can parse
int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
so directly calling it with the specific format=
should work.
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
One useful parameter is errors=
. By setting it to 'coerce'
, you can get NaT values for "broken" dates instead of having an error raised.
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')