Convert a column of datetimes to epoch in Python
Question:
I’m currently having an issue with Python. I have a Pandas DataFrame and one of the columns is a string with a date.
The format is :
"%Y-%m-%d %H:%m:00.000". For example : "2011-04-24 01:30:00.000"
I need to convert the entire column to integers. I tried to run this code, but it is extremely slow and I have a few million rows.
for i in range(calls.shape[0]):
calls['dateint'][i] = int(time.mktime(time.strptime(calls.DATE[i], "%Y-%m-%d %H:%M:00.000")))
Do you guys know how to convert the whole column to epoch time?
Answers:
convert the string to a datetime
using to_datetime
and then subtract datetime 1970-1-1 and call dt.total_seconds()
:
In [2]:
import pandas as pd
import datetime as dt
df = pd.DataFrame({'date':['2011-04-24 01:30:00.000']})
df
Out[2]:
date
0 2011-04-24 01:30:00.000
In [3]:
df['date'] = pd.to_datetime(df['date'])
df
Out[3]:
date
0 2011-04-24 01:30:00
In [6]:
(df['date'] - dt.datetime(1970,1,1)).dt.total_seconds()
Out[6]:
0 1303608600
Name: date, dtype: float64
You can see that converting this value back yields the same time:
In [8]:
pd.to_datetime(1303608600, unit='s')
Out[8]:
Timestamp('2011-04-24 01:30:00')
So you can either add a new column or overwrite:
In [9]:
df['epoch'] = (df['date'] - dt.datetime(1970,1,1)).dt.total_seconds()
df
Out[9]:
date epoch
0 2011-04-24 01:30:00 1303608600
EDIT
better method as suggested by @Jeff:
In [3]:
df['date'].astype('int64')//1e9
Out[3]:
0 1303608600
Name: date, dtype: float64
In [4]:
%timeit (df['date'] - dt.datetime(1970,1,1)).dt.total_seconds()
%timeit df['date'].astype('int64')//1e9
100 loops, best of 3: 1.72 ms per loop
1000 loops, best of 3: 275 µs per loop
You can also see that it is significantly faster
From the Pandas documentation on working with time series data:
We subtract the epoch (midnight at January 1, 1970 UTC) and then floor divide by the “unit” (1 ms).
# generate some timestamps
stamps = pd.date_range('2012-10-08 18:15:05', periods=4, freq='D')
# convert it to milliseconds from epoch
(stamps - pd.Timestamp("1970-01-01")) // pd.Timedelta('1ms')
This will give the epoch time in milliseconds.
I know this is old but I believe the correct (and cleanest) way is the single line below:
calls['DATE'].apply(lambda x: x.timestamp())
This assumes calls['DATE']
is a datetime64[ns]
type. If not, convert it with:
pd.to_datetime(calls['DATE'], format="%Y-%m-%d %H:%m:00.000")
Explanation
To get the epoch value (in seconds) of a pd.Timestamp
, use:
pd.Timestamp('20200101').timestamp()
This should give you 1577836800.0
. You can cast to an int
if you want. The reason it is a float is because any subsecond time will be in the decimal part.
You can also get the raw epoch value (in nanoseconds):
pd.Timestamp('20200101').value
Gives 1577836800000000000 which is the epoch of the date above. The .value
attribute is the number of nanoseconds since epoch so divide by 1e6 to get to milliseconds. Divide by 1e9 if you want epoch in seconds as the first call.
To expand on the answer of s5s, I think the code can be further generalised to cater for missing data (represented by pd.NaT, for example). Tested on Pandas 1.2.4, won’t work on Pandas < 1.0.
calls['DATE'].apply(lambda x: x.timestamp() if not pd.isna(x) else pd.NA).astype('Int64')
Some comments:
-
pd.isna() will catch pd.NaT
-
The lambda expression translates pd.NaT to pd.NA, which will be the new representation of missing data
-
Finally, the output from the lambda expression will be a mix of integers and pd.NA, thus we need a Pandas ExtensionDtype such as Int64 to handle that
Sample output:
0 <NA>
1 <NA>
2 <NA>
3 <NA>
4 <NA>
...
865 1619136000
866 1619136000
...
Name: DATE, Length: 870, dtype: Int64
Another way is, after subtracting the Unix epoch, convert the dtype to 'timedelta64[s]'
(note the [s]
) to specify that you want the difference in seconds or 'timedelta[ms]'
to specify that it should be in milliseconds, etc.
df['epoch'] = df['date'].sub(pd.Timestamp('1970-01-01')).astype('timedelta64[s]')
As of writing these lines, you can do that very easily with pandas (tested with version 1.5.2). Here is a working example with a DataFrame filled with strings representing timestamps.
df = pd.DataFrame(data=["2022-08-01T22:45:12", "2022-08-01T22:46:12", "2022-08-01T22:47:12"], columns=["time"])
df['time'].apply(lambda x: pd.Timestamp(x).timestamp())
Note that the function timestamp() returns a POSIX timestamp as float. If you do not have milliseconds with the timestamps, you can cast the result as integer.
df['time'].apply(lambda x: int(pd.Timestamp(x).timestamp()))
I’m currently having an issue with Python. I have a Pandas DataFrame and one of the columns is a string with a date.
The format is :
"%Y-%m-%d %H:%m:00.000". For example : "2011-04-24 01:30:00.000"
I need to convert the entire column to integers. I tried to run this code, but it is extremely slow and I have a few million rows.
for i in range(calls.shape[0]):
calls['dateint'][i] = int(time.mktime(time.strptime(calls.DATE[i], "%Y-%m-%d %H:%M:00.000")))
Do you guys know how to convert the whole column to epoch time?
convert the string to a datetime
using to_datetime
and then subtract datetime 1970-1-1 and call dt.total_seconds()
:
In [2]:
import pandas as pd
import datetime as dt
df = pd.DataFrame({'date':['2011-04-24 01:30:00.000']})
df
Out[2]:
date
0 2011-04-24 01:30:00.000
In [3]:
df['date'] = pd.to_datetime(df['date'])
df
Out[3]:
date
0 2011-04-24 01:30:00
In [6]:
(df['date'] - dt.datetime(1970,1,1)).dt.total_seconds()
Out[6]:
0 1303608600
Name: date, dtype: float64
You can see that converting this value back yields the same time:
In [8]:
pd.to_datetime(1303608600, unit='s')
Out[8]:
Timestamp('2011-04-24 01:30:00')
So you can either add a new column or overwrite:
In [9]:
df['epoch'] = (df['date'] - dt.datetime(1970,1,1)).dt.total_seconds()
df
Out[9]:
date epoch
0 2011-04-24 01:30:00 1303608600
EDIT
better method as suggested by @Jeff:
In [3]:
df['date'].astype('int64')//1e9
Out[3]:
0 1303608600
Name: date, dtype: float64
In [4]:
%timeit (df['date'] - dt.datetime(1970,1,1)).dt.total_seconds()
%timeit df['date'].astype('int64')//1e9
100 loops, best of 3: 1.72 ms per loop
1000 loops, best of 3: 275 µs per loop
You can also see that it is significantly faster
From the Pandas documentation on working with time series data:
We subtract the epoch (midnight at January 1, 1970 UTC) and then floor divide by the “unit” (1 ms).
# generate some timestamps
stamps = pd.date_range('2012-10-08 18:15:05', periods=4, freq='D')
# convert it to milliseconds from epoch
(stamps - pd.Timestamp("1970-01-01")) // pd.Timedelta('1ms')
This will give the epoch time in milliseconds.
I know this is old but I believe the correct (and cleanest) way is the single line below:
calls['DATE'].apply(lambda x: x.timestamp())
This assumes calls['DATE']
is a datetime64[ns]
type. If not, convert it with:
pd.to_datetime(calls['DATE'], format="%Y-%m-%d %H:%m:00.000")
Explanation
To get the epoch value (in seconds) of a pd.Timestamp
, use:
pd.Timestamp('20200101').timestamp()
This should give you 1577836800.0
. You can cast to an int
if you want. The reason it is a float is because any subsecond time will be in the decimal part.
You can also get the raw epoch value (in nanoseconds):
pd.Timestamp('20200101').value
Gives 1577836800000000000 which is the epoch of the date above. The .value
attribute is the number of nanoseconds since epoch so divide by 1e6 to get to milliseconds. Divide by 1e9 if you want epoch in seconds as the first call.
To expand on the answer of s5s, I think the code can be further generalised to cater for missing data (represented by pd.NaT, for example). Tested on Pandas 1.2.4, won’t work on Pandas < 1.0.
calls['DATE'].apply(lambda x: x.timestamp() if not pd.isna(x) else pd.NA).astype('Int64')
Some comments:
-
pd.isna() will catch pd.NaT
-
The lambda expression translates pd.NaT to pd.NA, which will be the new representation of missing data
-
Finally, the output from the lambda expression will be a mix of integers and pd.NA, thus we need a Pandas ExtensionDtype such as Int64 to handle that
Sample output:
0 <NA>
1 <NA>
2 <NA>
3 <NA>
4 <NA>
...
865 1619136000
866 1619136000
...
Name: DATE, Length: 870, dtype: Int64
Another way is, after subtracting the Unix epoch, convert the dtype to 'timedelta64[s]'
(note the [s]
) to specify that you want the difference in seconds or 'timedelta[ms]'
to specify that it should be in milliseconds, etc.
df['epoch'] = df['date'].sub(pd.Timestamp('1970-01-01')).astype('timedelta64[s]')
As of writing these lines, you can do that very easily with pandas (tested with version 1.5.2). Here is a working example with a DataFrame filled with strings representing timestamps.
df = pd.DataFrame(data=["2022-08-01T22:45:12", "2022-08-01T22:46:12", "2022-08-01T22:47:12"], columns=["time"])
df['time'].apply(lambda x: pd.Timestamp(x).timestamp())
Note that the function timestamp() returns a POSIX timestamp as float. If you do not have milliseconds with the timestamps, you can cast the result as integer.
df['time'].apply(lambda x: int(pd.Timestamp(x).timestamp()))