How to convert string to datetime with nulls – python, pandas?

Question:

I have a series with some datetimes (as strings) and some nulls as ‘nan’:

import pandas as pd, numpy as np, datetime as dt
df = pd.DataFrame({'Date':['2014-10-20 10:44:31', '2014-10-23 09:33:46', 'nan', '2014-10-01 09:38:45']})

I’m trying to convert these to datetime:

df['Date'] = df['Date'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

but I get the error:

time data 'nan' does not match format '%Y-%m-%d %H:%M:%S'

So I try to turn these into actual nulls:

df.ix[df['Date'] == 'nan', 'Date'] = np.NaN

and repeat:

df['Date'] = df['Date'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

but then I get the error:

must be string, not float

What is the quickest way to solve this problem?

Asked By: Colin O'Brien

||

Answers:

Just use to_datetime and set errors='coerce' to handle duff data:

In [321]:

df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df
Out[321]:
                 Date
0 2014-10-20 10:44:31
1 2014-10-23 09:33:46
2                 NaT
3 2014-10-01 09:38:45

In [322]:

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
Date    3 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 64.0 bytes

the problem with calling strptime is that it will raise an error if the string, or dtype is incorrect.

If you did this then it would work:

In [324]:

def func(x):
    try:
        return dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
    except:
        return pd.NaT

df['Date'].apply(func)
Out[324]:
0   2014-10-20 10:44:31
1   2014-10-23 09:33:46
2                   NaT
3   2014-10-01 09:38:45
Name: Date, dtype: datetime64[ns]

but it will be faster to use the inbuilt to_datetime rather than call apply which essentially just loops over your series.

timings

In [326]:

%timeit pd.to_datetime(df['Date'], errors='coerce')
%timeit df['Date'].apply(func)
10000 loops, best of 3: 65.8 µs per loop
10000 loops, best of 3: 186 µs per loop

We see here that using to_datetime is 3X faster.

Answered By: EdChum

I find letting pandas do the work to be too slow on large dataframes. In another post I learned of a technique that speeds this up dramatically when the number of unique values is much smaller than the number of rows. (My data is usually stock price or trade blotter data.) It first builds a dict that maps the text dates to their datetime objects, then applies the dict to convert the column of text dates.

def str2time(val):
    try:
        return dt.datetime.strptime(val, '%H:%M:%S.%f')
    except:
        return pd.NaT

def TextTime2Time(s):
    times = {t : str2time(t) for t in s.unique()}
    return s.apply(lambda v: times[v])

df.date = TextTime2Time(df.date)
Answered By: jdmarino