Cast column containing multiple string date formats to DateTime in Spark
Question:
I have a date column in my Spark DataDrame
that contains multiple string formats. I would like to cast these to DateTime.
The two formats in my column are:
mm/dd/yyyy
; and
yyyy-mm-dd
My solution so far is to use a UDF to change the first date format to match the second as follows:
import re
def parseDate(dateString):
if re.match('d{1,2}/d{1,2}/d{4}', dateString) is not None:
return datetime.strptime(dateString, '%M/%d/%Y').strftime('%Y-%M-%d')
else:
return dateString
# Create Spark UDF based on above function
dateUdf = udf(parseDate)
df = (df.select(to_date(dateUdf(raw_transactions_df['trans_dt']))))
This works, but is not all that fault-tolerant. I am specifically concerned about:
- Date formats I am yet to encounter.
- Distinguishing between
mm/dd/yyyy
and dd/mm/yyyy
(the regex I’m using clearly doesn’t do this at the moment).
Is there a better way to do this?
Answers:
Personally I would recommend using SQL functions directly without expensive and inefficient reformatting:
from pyspark.sql.functions import coalesce, to_date
def to_date_(col, formats=("MM/dd/yyyy", "yyyy-MM-dd")):
# Spark 2.2 or later syntax, for < 2.2 use unix_timestamp and cast
return coalesce(*[to_date(col, f) for f in formats])
This will choose the first format, which can successfully parse input string.
Usage:
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", to_date_("dt")).show()
+---+----------+----------+
| id| dt| pdt|
+---+----------+----------+
| 1|01/22/2010|2010-01-22|
| 2|2018-12-01|2018-12-01|
+---+----------+----------+
It will be faster than udf
, and adding new formats is just a matter of adjusting formats
parameter.
However it won’t help you with format ambiguities. In general case it might not be possible to do it without manual intervention and cross referencing with external data.
The same thing can be of course done in Scala:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{coalesce, to_date}
def to_date_(col: Column,
formats: Seq[String] = Seq("MM/dd/yyyy", "yyyy-MM-dd")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}
You can do this in 100% sql with something like this:
create database delete_me;
use delete_me;
create table test (enc_date string);
insert into test values ('10/28/2019');
insert into test values ('2020-03-31 00:00:00.000');
insert into test values ('2019-10-18');
insert into test values ('gobledie-gook');
insert into test values ('');
insert into test values (null);
insert into test values ('NULL');
-- you might need the following line depending on your version of spark
-- set spark.sql.legacy.timeParserPolicy = LEGACY;
select enc_date, coalesce(to_date(enc_date, "yyyy-MM-dd"), to_date(enc_date, "MM/dd/yyyy")) as date from test;
enc_date date
-------- ----
2020-03-31 00:00:00.000 2020-03-31
2019-10-18 2019-10-18
null null
10/28/2019 2019-10-28
gobledie-gook null
NULL null
null
How we can achieve this for multiple date columns?
I have a date column in my Spark DataDrame
that contains multiple string formats. I would like to cast these to DateTime.
The two formats in my column are:
mm/dd/yyyy
; andyyyy-mm-dd
My solution so far is to use a UDF to change the first date format to match the second as follows:
import re
def parseDate(dateString):
if re.match('d{1,2}/d{1,2}/d{4}', dateString) is not None:
return datetime.strptime(dateString, '%M/%d/%Y').strftime('%Y-%M-%d')
else:
return dateString
# Create Spark UDF based on above function
dateUdf = udf(parseDate)
df = (df.select(to_date(dateUdf(raw_transactions_df['trans_dt']))))
This works, but is not all that fault-tolerant. I am specifically concerned about:
- Date formats I am yet to encounter.
- Distinguishing between
mm/dd/yyyy
anddd/mm/yyyy
(the regex I’m using clearly doesn’t do this at the moment).
Is there a better way to do this?
Personally I would recommend using SQL functions directly without expensive and inefficient reformatting:
from pyspark.sql.functions import coalesce, to_date
def to_date_(col, formats=("MM/dd/yyyy", "yyyy-MM-dd")):
# Spark 2.2 or later syntax, for < 2.2 use unix_timestamp and cast
return coalesce(*[to_date(col, f) for f in formats])
This will choose the first format, which can successfully parse input string.
Usage:
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", to_date_("dt")).show()
+---+----------+----------+
| id| dt| pdt|
+---+----------+----------+
| 1|01/22/2010|2010-01-22|
| 2|2018-12-01|2018-12-01|
+---+----------+----------+
It will be faster than udf
, and adding new formats is just a matter of adjusting formats
parameter.
However it won’t help you with format ambiguities. In general case it might not be possible to do it without manual intervention and cross referencing with external data.
The same thing can be of course done in Scala:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{coalesce, to_date}
def to_date_(col: Column,
formats: Seq[String] = Seq("MM/dd/yyyy", "yyyy-MM-dd")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}
You can do this in 100% sql with something like this:
create database delete_me;
use delete_me;
create table test (enc_date string);
insert into test values ('10/28/2019');
insert into test values ('2020-03-31 00:00:00.000');
insert into test values ('2019-10-18');
insert into test values ('gobledie-gook');
insert into test values ('');
insert into test values (null);
insert into test values ('NULL');
-- you might need the following line depending on your version of spark
-- set spark.sql.legacy.timeParserPolicy = LEGACY;
select enc_date, coalesce(to_date(enc_date, "yyyy-MM-dd"), to_date(enc_date, "MM/dd/yyyy")) as date from test;
enc_date date
-------- ----
2020-03-31 00:00:00.000 2020-03-31
2019-10-18 2019-10-18
null null
10/28/2019 2019-10-28
gobledie-gook null
NULL null
null
How we can achieve this for multiple date columns?