Pandas: slow date conversion

Question:

I’m reading a huge CSV with a date field in the format YYYYMMDD and I’m using the following lambda to convert it when reading:

import pandas as pd

df = pd.read_csv(filen,
                 index_col=None,
                 header=None,
                 parse_dates=[0],
                 date_parser=lambda t:pd.to_datetime(str(t),
                                            format='%Y%m%d', coerce=True))

This function is very slow though.

Any suggestion to improve it?

Asked By: ppaulojr

||

Answers:

Try the standard library:

import datetime
parser = lambda t: datetime.datetime.strptime(str(t), "%Y%m%d")

However, I don’t really know if this is much faster than pandas.

Since your format is so simple, what about

def parse(t):
     string_ = str(t)
     return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))

EDIT you say you need to take care of invalid data.

def parse(t):
     string_ = str(t)
     try:
         return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
     except:
         return default_datetime #you should define that somewhere else

All in all, I’m a bit conflicted about the validity of your problem:

  • you need to be fast, but still you get your data from a CSV
  • you need to be fast, but still need to deal with invalid data

That’s kind of contradicting; my personal approach here would be assuming that your “huge” CSV just needs to be brought into a better-performing format once, and you either shouldn’t care about speed of that conversion process (because it only happens once) or you should probably bring whatever produces the CSV to give you better data–there’s so many formats that don’t rely on string parsing.

Answered By: Marcus Müller

Note: As @ritchie46’s answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates that defaults to True

Try using this function for parsing dates:

def lookup(date_pd_series, format=None):
    """
    This is an extremely fast approach to datetime parsing.
    For large data, the same dates are often repeated. Rather than
    re-parse these, we store all unique dates, parse them, and
    use a lookup to convert all dates.
    """
    dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()}
    return date_pd_series.map(dates)

Use it like:

df['date-column'] = lookup(df['date-column'], format='%Y%m%d')

Benchmarks:

$ python date-parse.py
to_datetime: 5799 ms
dateutil:    5162 ms
strptime:    1651 ms
manual:       242 ms
lookup:        32 ms

Source: https://github.com/sanand0/benchmarks/tree/master/date-parse

Answered By: fixxxer

No need to specify a date_parser, pandas is able to parse this without any trouble, plus it will be much faster:

In [21]:

import io
import pandas as pd
t="""date,val
20120608,12321
20130608,12321
20140308,12321"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
date    3 non-null datetime64[ns]
val     3 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 72.0 bytes
In [22]:

df
Out[22]:
        date    val
0 2012-06-08  12321
1 2013-06-08  12321
2 2014-03-08  12321
Answered By: EdChum

Great suggestion @EdChum! As @EdChum suggests, using infer_datetime_format=True can be significantly faster. Below is my example.

I have a file of temperature data from a sensor log, which looks like this:

RecNum,Date,LocationID,Unused
1,11/7/2013 20:53:01,13.60,"117","1",
2,11/7/2013 21:08:01,13.60,"117","1",
3,11/7/2013 21:23:01,13.60,"117","1",
4,11/7/2013 21:38:01,13.60,"117","1",
...

My code reads the csv and parses the date (parse_dates=['Date']).
With infer_datetime_format=False, it takes 8min 8sec:

Tue Jan 24 12:18:27 2017 - Loading the Temperature data file.
Tue Jan 24 12:18:27 2017 - Temperature file is 88.172 MB.
Tue Jan 24 12:18:27 2017 - Loading into memory. Please be patient.
Tue Jan 24 12:26:35 2017 - Success: loaded 2,169,903 records.

With infer_datetime_format=True, it takes 13sec:

Tue Jan 24 13:19:58 2017 - Loading the Temperature data file.
Tue Jan 24 13:19:58 2017 - Temperature file is 88.172 MB.
Tue Jan 24 13:19:58 2017 - Loading into memory. Please be patient.
Tue Jan 24 13:20:11 2017 - Success: loaded 2,169,903 records.
Answered By: Sam Davey

Streamlined date parsing with caching

Reading all data and then converting it will always be slower than converting while reading the CSV. Since you won’t need to iterate over all the data twice if you do it right away. You also don’t have to store it as strings in memory.

We can define our own date parser that utilizes a cache for the dates it has already seen.

import pandas as pd

cache = {}

def cached_date_parser(s):
    if s in cache:
        return cache[s]
    dt = pd.to_datetime(s, format='%Y%m%d', coerce=True)
    cache[s] = dt
    return dt
    
df = pd.read_csv(filen,
                 index_col=None,
                 header=None,
                 parse_dates=[0],
                 date_parser=cached_date_parser)

Has the same advantages as @fixxxer s answer with only parsing each string once, with the extra added bonus of not having to read all the data and THEN parse it. Saving you memory and processing time.

Answered By: firelynx

If your datetime has UTC timestamp and you just need part of it. Convert it to a string, slice what you need and then apply the below for much faster access.

created_at
2018-01-31 15:15:08 UTC
2018-01-31 15:16:02 UTC
2018-01-31 15:27:10 UTC
2018-02-01 07:05:55 UTC
2018-02-01 08:50:14 UTC

df["date"]=  df["created_at"].apply(lambda x: str(x)[:10])


df["date"] = pd.to_datetime(df["date"])
Answered By: srikar saggurthi

I have a csv with ~150k rows. After trying almost all the suggestions in this post, I found 25% faster to:

  1. read the file row by row using Python3.7 native csv.reader
  2. convert all 4 numeric columns using float() and
  3. parse the date column with datetime.datetime.fromisoformat()

and Behold:

  1. finally convert the list to a DataFrame (!)**

It baffles me how can this be faster than native pandas pd.read_csv(…)… can someone explain?

Answered By: arod

Since pandas version 0.25 the function pandas.read_csv accepts a cache_dates=boolean (which defaults to True) keyword argument. So no need to write your own function for caching as done in the accepted answer.

Answered By: ritchie46
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.