pyspark to_date convert returning null for invalid dates

Question:

I am trying to convert a string column to date using to_date. Everything is working fine, however, my requirement is to fail the spark job if there is any bad data, that is, any malformed input for date. Currently, to_date is returning null, but not falling. How to make sure that job will be failed in such scenario?

Asked By: soumya-kole

||

Answers:

The behavior of the to_date function is dependent on the spark.sql.ansi.enabled Spark option.
When it is disabled (the default), Spark uses a Hive compliant dialect and returns null results instead of failing.
Conversely, if enabled, Spark will be ANSI compliant and will fail if the input is malformed as stated here.

That said, you may not want to enable spark.sql.ansi.enabled because it has many other effects, see here.

An alternative solution is to use an UDF instead of the to_date function to perform the date parsing, and throw an exception if the parse fails.

Answered By: vinsce