How to create date from year, month and day in PySpark?

Question:

I have three columns about year, month and day. How can I use these to create date in PySpark?

Asked By: Yi Du

||

Answers:

You can use concat_ws() to concat columns with - and cast to date.

#sampledata
df.show()

#+----+-----+---+
#|year|month|day|
#+----+-----+---+
#|2020|   12| 12|
#+----+-----+---+
from pyspark.sql.functions import *

df.withColumn("date",concat_ws("-",col("year"),col("month"),col("day")).cast("date")).show()
+----+-----+---+----------+
|year|month|day|      date|
+----+-----+---+----------+
|2020|   12| 12|2020-12-12|
+----+-----+---+----------+

#dynamic way
cols=["year","month","day"]
df.withColumn("date",concat_ws("-",*cols).cast("date")).show()
#+----+-----+---+----------+
#|year|month|day|      date|
#+----+-----+---+----------+
#|2020|   12| 12|2020-12-12|
#+----+-----+---+----------+

#using date_format,to_timestamp,from_unixtime(unix_timestamp) functions

df.withColumn("date",date_format(concat_ws("-",*cols),"yyyy-MM-dd").cast("date")).show()
df.withColumn("date",to_timestamp(concat_ws("-",*cols),"yyyy-MM-dd").cast("date")).show()
df.withColumn("date",to_date(concat_ws("-",*cols),"yyyy-MM-dd")).show()
df.withColumn("date",from_unixtime(unix_timestamp(concat_ws("-",*cols),"yyyy-MM-dd"),"yyyy-MM-dd").cast("date")).show()
#+----+-----+---+----------+
#|year|month|day|      date|
#+----+-----+---+----------+
#|2020|   12| 12|2020-12-12|
#+----+-----+---+----------+
Answered By: notNull

For Spark 3+, you can use make_date function:

df = df.withColumn("date", expr("make_date(year, month, day)"))
Answered By: blackbishop

Using pyspark on DataBrick, here is a solution when you have a pure string; unix_timestamp may not work unfortunately and yields wrong results. be very causious when using unix_timestamp, or to_date commands in pyspark.
for example if your string has a fromat like "20140625" they simply generate totally wrong version of input dates. In my case no method works except concatantion from building the string again and cast it as date as follows.

from pyspark.sql.functions import col, lit, substring, concat

# string format to deal with: "20050627","19900401",...

#Create a new column with a shorter name to keep the originalcolumns as well
df.withColumn("dod",col("date_of_death"))

#create date upon string components
df.withColumn("dod", concat(substring(df.dod,1,4),lit("-"),substring(df.dod,5,2),lit("-"),substring(df.dod,7,2)).cast("date")))

the results look like this:

enter image description here

beware of using the following format. it most probabily and oddly generates wrong results without raising and showing you any error. in my case it ruinedd most of my analsyse:

### wrong use! use only on strings with delimeters ("yyyy-mm-dd) and be highly causious!
f.to_date(f.unix_timestamp(df.dod,"yyyymmdd").cast("timestamp"))
Answered By: ashkan