Pandas: slow date conversion
Question:
I’m reading a huge CSV
with a date field in the format YYYYMMDD
and I’m using the following lambda to convert it when reading:
import pandas as pd
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=lambda t:pd.to_datetime(str(t),
format='%Y%m%d', coerce=True))
This function is very slow though.
Any suggestion to improve it?
Answers:
Try the standard library:
import datetime
parser = lambda t: datetime.datetime.strptime(str(t), "%Y%m%d")
However, I don’t really know if this is much faster than pandas.
Since your format is so simple, what about
def parse(t):
string_ = str(t)
return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
EDIT you say you need to take care of invalid data.
def parse(t):
string_ = str(t)
try:
return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
except:
return default_datetime #you should define that somewhere else
All in all, I’m a bit conflicted about the validity of your problem:
- you need to be fast, but still you get your data from a CSV
- you need to be fast, but still need to deal with invalid data
That’s kind of contradicting; my personal approach here would be assuming that your “huge” CSV just needs to be brought into a better-performing format once, and you either shouldn’t care about speed of that conversion process (because it only happens once) or you should probably bring whatever produces the CSV to give you better data–there’s so many formats that don’t rely on string parsing.
Note: As @ritchie46’s answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates
that defaults to True
Try using this function for parsing dates:
def lookup(date_pd_series, format=None):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()}
return date_pd_series.map(dates)
Use it like:
df['date-column'] = lookup(df['date-column'], format='%Y%m%d')
Benchmarks:
$ python date-parse.py
to_datetime: 5799 ms
dateutil: 5162 ms
strptime: 1651 ms
manual: 242 ms
lookup: 32 ms
Source: https://github.com/sanand0/benchmarks/tree/master/date-parse
No need to specify a date_parser
, pandas is able to parse this without any trouble, plus it will be much faster:
In [21]:
import io
import pandas as pd
t="""date,val
20120608,12321
20130608,12321
20140308,12321"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
date 3 non-null datetime64[ns]
val 3 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 72.0 bytes
In [22]:
df
Out[22]:
date val
0 2012-06-08 12321
1 2013-06-08 12321
2 2014-03-08 12321
Great suggestion @EdChum! As @EdChum suggests, using infer_datetime_format=True
can be significantly faster. Below is my example.
I have a file of temperature data from a sensor log, which looks like this:
RecNum,Date,LocationID,Unused
1,11/7/2013 20:53:01,13.60,"117","1",
2,11/7/2013 21:08:01,13.60,"117","1",
3,11/7/2013 21:23:01,13.60,"117","1",
4,11/7/2013 21:38:01,13.60,"117","1",
...
My code reads the csv and parses the date (parse_dates=['Date']
).
With infer_datetime_format=False
, it takes 8min 8sec:
Tue Jan 24 12:18:27 2017 - Loading the Temperature data file.
Tue Jan 24 12:18:27 2017 - Temperature file is 88.172 MB.
Tue Jan 24 12:18:27 2017 - Loading into memory. Please be patient.
Tue Jan 24 12:26:35 2017 - Success: loaded 2,169,903 records.
With infer_datetime_format=True
, it takes 13sec:
Tue Jan 24 13:19:58 2017 - Loading the Temperature data file.
Tue Jan 24 13:19:58 2017 - Temperature file is 88.172 MB.
Tue Jan 24 13:19:58 2017 - Loading into memory. Please be patient.
Tue Jan 24 13:20:11 2017 - Success: loaded 2,169,903 records.
Streamlined date parsing with caching
Reading all data and then converting it will always be slower than converting while reading the CSV. Since you won’t need to iterate over all the data twice if you do it right away. You also don’t have to store it as strings in memory.
We can define our own date parser that utilizes a cache for the dates it has already seen.
import pandas as pd
cache = {}
def cached_date_parser(s):
if s in cache:
return cache[s]
dt = pd.to_datetime(s, format='%Y%m%d', coerce=True)
cache[s] = dt
return dt
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=cached_date_parser)
Has the same advantages as @fixxxer s answer with only parsing each string once, with the extra added bonus of not having to read all the data and THEN parse it. Saving you memory and processing time.
If your datetime has UTC timestamp and you just need part of it. Convert it to a string, slice what you need and then apply the below for much faster access.
created_at
2018-01-31 15:15:08 UTC
2018-01-31 15:16:02 UTC
2018-01-31 15:27:10 UTC
2018-02-01 07:05:55 UTC
2018-02-01 08:50:14 UTC
df["date"]= df["created_at"].apply(lambda x: str(x)[:10])
df["date"] = pd.to_datetime(df["date"])
I have a csv with ~150k rows. After trying almost all the suggestions in this post, I found 25% faster to:
- read the file row by row using Python3.7 native
csv.reader
- convert all 4 numeric columns using
float()
and
- parse the date column with
datetime.datetime.fromisoformat()
and Behold:
- finally convert the list to a DataFrame (!)**
It baffles me how can this be faster than native pandas pd.read_csv(…)… can someone explain?
Since pandas version 0.25 the function pandas.read_csv
accepts a cache_dates=boolean
(which defaults to True
) keyword argument. So no need to write your own function for caching as done in the accepted answer.
I’m reading a huge CSV
with a date field in the format YYYYMMDD
and I’m using the following lambda to convert it when reading:
import pandas as pd
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=lambda t:pd.to_datetime(str(t),
format='%Y%m%d', coerce=True))
This function is very slow though.
Any suggestion to improve it?
Try the standard library:
import datetime
parser = lambda t: datetime.datetime.strptime(str(t), "%Y%m%d")
However, I don’t really know if this is much faster than pandas.
Since your format is so simple, what about
def parse(t):
string_ = str(t)
return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
EDIT you say you need to take care of invalid data.
def parse(t):
string_ = str(t)
try:
return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
except:
return default_datetime #you should define that somewhere else
All in all, I’m a bit conflicted about the validity of your problem:
- you need to be fast, but still you get your data from a CSV
- you need to be fast, but still need to deal with invalid data
That’s kind of contradicting; my personal approach here would be assuming that your “huge” CSV just needs to be brought into a better-performing format once, and you either shouldn’t care about speed of that conversion process (because it only happens once) or you should probably bring whatever produces the CSV to give you better data–there’s so many formats that don’t rely on string parsing.
Note: As @ritchie46’s answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates
that defaults to True
Try using this function for parsing dates:
def lookup(date_pd_series, format=None):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()}
return date_pd_series.map(dates)
Use it like:
df['date-column'] = lookup(df['date-column'], format='%Y%m%d')
Benchmarks:
$ python date-parse.py
to_datetime: 5799 ms
dateutil: 5162 ms
strptime: 1651 ms
manual: 242 ms
lookup: 32 ms
Source: https://github.com/sanand0/benchmarks/tree/master/date-parse
No need to specify a date_parser
, pandas is able to parse this without any trouble, plus it will be much faster:
In [21]:
import io
import pandas as pd
t="""date,val
20120608,12321
20130608,12321
20140308,12321"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
date 3 non-null datetime64[ns]
val 3 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 72.0 bytes
In [22]:
df
Out[22]:
date val
0 2012-06-08 12321
1 2013-06-08 12321
2 2014-03-08 12321
Great suggestion @EdChum! As @EdChum suggests, using infer_datetime_format=True
can be significantly faster. Below is my example.
I have a file of temperature data from a sensor log, which looks like this:
RecNum,Date,LocationID,Unused 1,11/7/2013 20:53:01,13.60,"117","1", 2,11/7/2013 21:08:01,13.60,"117","1", 3,11/7/2013 21:23:01,13.60,"117","1", 4,11/7/2013 21:38:01,13.60,"117","1", ...
My code reads the csv and parses the date (parse_dates=['Date']
).
With infer_datetime_format=False
, it takes 8min 8sec:
Tue Jan 24 12:18:27 2017 - Loading the Temperature data file. Tue Jan 24 12:18:27 2017 - Temperature file is 88.172 MB. Tue Jan 24 12:18:27 2017 - Loading into memory. Please be patient. Tue Jan 24 12:26:35 2017 - Success: loaded 2,169,903 records.
With infer_datetime_format=True
, it takes 13sec:
Tue Jan 24 13:19:58 2017 - Loading the Temperature data file. Tue Jan 24 13:19:58 2017 - Temperature file is 88.172 MB. Tue Jan 24 13:19:58 2017 - Loading into memory. Please be patient. Tue Jan 24 13:20:11 2017 - Success: loaded 2,169,903 records.
Streamlined date parsing with caching
Reading all data and then converting it will always be slower than converting while reading the CSV. Since you won’t need to iterate over all the data twice if you do it right away. You also don’t have to store it as strings in memory.
We can define our own date parser that utilizes a cache for the dates it has already seen.
import pandas as pd
cache = {}
def cached_date_parser(s):
if s in cache:
return cache[s]
dt = pd.to_datetime(s, format='%Y%m%d', coerce=True)
cache[s] = dt
return dt
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=cached_date_parser)
Has the same advantages as @fixxxer s answer with only parsing each string once, with the extra added bonus of not having to read all the data and THEN parse it. Saving you memory and processing time.
If your datetime has UTC timestamp and you just need part of it. Convert it to a string, slice what you need and then apply the below for much faster access.
created_at
2018-01-31 15:15:08 UTC
2018-01-31 15:16:02 UTC
2018-01-31 15:27:10 UTC
2018-02-01 07:05:55 UTC
2018-02-01 08:50:14 UTC
df["date"]= df["created_at"].apply(lambda x: str(x)[:10])
df["date"] = pd.to_datetime(df["date"])
I have a csv with ~150k rows. After trying almost all the suggestions in this post, I found 25% faster to:
- read the file row by row using Python3.7 native
csv.reader
- convert all 4 numeric columns using
float()
and - parse the date column with
datetime.datetime.fromisoformat()
and Behold:
- finally convert the list to a DataFrame (!)**
It baffles me how can this be faster than native pandas pd.read_csv(…)… can someone explain?
Since pandas version 0.25 the function pandas.read_csv
accepts a cache_dates=boolean
(which defaults to True
) keyword argument. So no need to write your own function for caching as done in the accepted answer.