Python: How to convert Pyspark column to date type if there are null values

Question:

In pyspark, I have a dataframe that has dates that get imported as strings. There are null values in these dates-as-strings columns. I’m trying to convert these columns into date type columns, but I keep getting errors. Here’s a small example of the dataframe:

+--------+----------+----------+
|DeviceId|   Created| EventDate|
+--------+----------+----------+
|       1|      null|2017-03-09|
|       1|      null|2017-03-09|
|       1|2017-03-09|2017-03-09|
|       1|2017-03-15|2017-03-15|
|       1|      null|2017-05-06|
|       1|2017-05-06|2017-05-06|
|       1|      null|      null|
+--------+----------+----------+

When there are no null values, I have found that this code below will work to convert the data types:

dt_func =  udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())    
df = df.withColumn('Created', dt_func(col('Created')))

Once I add null values it crashes. I’ve tried to modify the udf to account for nulls as follows:

import numpy as np
def convertDatetime(x):
    return sf.when(x.isNull(), 'null').otherwise(datetime.strptime(x, '%Y-%m-%d'))
dt_func =  udf(convertDatetime, DateType())

I also tried filling the nulls with an arbitrary date-string, converting the columns to dates, and then trying to replace the arbitrary fill date with nulls as below:

def dt_conv(df, cols, form = '%Y-%m-%d', temp_plug = '1900-01-01'):
    df = df.na.fill(temp_plug)
    dt_func =  udf (lambda x: datetime.strptime(x, form), DateType())

    for col_ in cols:
        df = df.withColumn(col_, dt_func(col(col_)))
    df = df.replace(datetime.strptime(temp_plug, form), 'null')
    return df

However, this method gives me this error

ValueError: to_replace should be a float, int, long, string, list, tuple, or dict

Can someone help me figure this out?

Asked By: Jed

||

Answers:

try this –

# Some data, I added empty strings and nulls both
data = [(1,'','2017-03-09'),(1,None,'2017-03-09'),(1,'2017-03-09','2017-03-09')]

df = spark.createDataFrame(data).toDF('id','Created','EventDate')
df.show()

:

+---+----------+----------+
| id|   Created| EventDate|
+---+----------+----------+
|  1|          |2017-03-09|
|  1|      null|2017-03-09|
|  1|2017-03-09|2017-03-09|
+---+----------+----------+

:

df
.withColumn('Created-formatted',when((df.Created.isNull() | (df.Created == '')) ,'0')
.otherwise(unix_timestamp(df.Created,'yyyy-MM-dd')))
.withColumn('EventDate-formatted',when((df.EventDate.isNull() | (df.EventDate == '')) ,'0')
.otherwise(unix_timestamp(df.EventDate,'yyyy-MM-dd')))
.drop('Created','EventDate')
.show()

:

+---+-----------------+-------------------+
| id|Created-formatted|EventDate-formatted|
+---+-----------------+-------------------+
|  1|                0|         1489035600|
|  1|                0|         1489035600|
|  1|       1489035600|         1489035600|
+---+-----------------+-------------------+

I used unix_timestamp which returns BigInt format but you can format that columns as you like .

Answered By: Pushkr

Try this… just casting the column!

df_new = (df
            .select(to_date(col("df.EventDate"),"yyyy-MM-dd")
            .alias("EventDate-formatted")
         )
Answered By: TBluhm