Python: How to convert Pyspark column to date type if there are null values
Question:
In pyspark, I have a dataframe that has dates that get imported as strings. There are null values in these dates-as-strings columns. I’m trying to convert these columns into date type columns, but I keep getting errors. Here’s a small example of the dataframe:
+--------+----------+----------+
|DeviceId| Created| EventDate|
+--------+----------+----------+
| 1| null|2017-03-09|
| 1| null|2017-03-09|
| 1|2017-03-09|2017-03-09|
| 1|2017-03-15|2017-03-15|
| 1| null|2017-05-06|
| 1|2017-05-06|2017-05-06|
| 1| null| null|
+--------+----------+----------+
When there are no null values, I have found that this code below will work to convert the data types:
dt_func = udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())
df = df.withColumn('Created', dt_func(col('Created')))
Once I add null values it crashes. I’ve tried to modify the udf to account for nulls as follows:
import numpy as np
def convertDatetime(x):
return sf.when(x.isNull(), 'null').otherwise(datetime.strptime(x, '%Y-%m-%d'))
dt_func = udf(convertDatetime, DateType())
I also tried filling the nulls with an arbitrary date-string, converting the columns to dates, and then trying to replace the arbitrary fill date with nulls as below:
def dt_conv(df, cols, form = '%Y-%m-%d', temp_plug = '1900-01-01'):
df = df.na.fill(temp_plug)
dt_func = udf (lambda x: datetime.strptime(x, form), DateType())
for col_ in cols:
df = df.withColumn(col_, dt_func(col(col_)))
df = df.replace(datetime.strptime(temp_plug, form), 'null')
return df
However, this method gives me this error
ValueError: to_replace should be a float, int, long, string, list, tuple, or dict
Can someone help me figure this out?
Answers:
try this –
# Some data, I added empty strings and nulls both
data = [(1,'','2017-03-09'),(1,None,'2017-03-09'),(1,'2017-03-09','2017-03-09')]
df = spark.createDataFrame(data).toDF('id','Created','EventDate')
df.show()
:
+---+----------+----------+
| id| Created| EventDate|
+---+----------+----------+
| 1| |2017-03-09|
| 1| null|2017-03-09|
| 1|2017-03-09|2017-03-09|
+---+----------+----------+
:
df
.withColumn('Created-formatted',when((df.Created.isNull() | (df.Created == '')) ,'0')
.otherwise(unix_timestamp(df.Created,'yyyy-MM-dd')))
.withColumn('EventDate-formatted',when((df.EventDate.isNull() | (df.EventDate == '')) ,'0')
.otherwise(unix_timestamp(df.EventDate,'yyyy-MM-dd')))
.drop('Created','EventDate')
.show()
:
+---+-----------------+-------------------+
| id|Created-formatted|EventDate-formatted|
+---+-----------------+-------------------+
| 1| 0| 1489035600|
| 1| 0| 1489035600|
| 1| 1489035600| 1489035600|
+---+-----------------+-------------------+
I used unix_timestamp
which returns BigInt format but you can format that columns as you like .
Try this… just casting the column!
df_new = (df
.select(to_date(col("df.EventDate"),"yyyy-MM-dd")
.alias("EventDate-formatted")
)
In pyspark, I have a dataframe that has dates that get imported as strings. There are null values in these dates-as-strings columns. I’m trying to convert these columns into date type columns, but I keep getting errors. Here’s a small example of the dataframe:
+--------+----------+----------+
|DeviceId| Created| EventDate|
+--------+----------+----------+
| 1| null|2017-03-09|
| 1| null|2017-03-09|
| 1|2017-03-09|2017-03-09|
| 1|2017-03-15|2017-03-15|
| 1| null|2017-05-06|
| 1|2017-05-06|2017-05-06|
| 1| null| null|
+--------+----------+----------+
When there are no null values, I have found that this code below will work to convert the data types:
dt_func = udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())
df = df.withColumn('Created', dt_func(col('Created')))
Once I add null values it crashes. I’ve tried to modify the udf to account for nulls as follows:
import numpy as np
def convertDatetime(x):
return sf.when(x.isNull(), 'null').otherwise(datetime.strptime(x, '%Y-%m-%d'))
dt_func = udf(convertDatetime, DateType())
I also tried filling the nulls with an arbitrary date-string, converting the columns to dates, and then trying to replace the arbitrary fill date with nulls as below:
def dt_conv(df, cols, form = '%Y-%m-%d', temp_plug = '1900-01-01'):
df = df.na.fill(temp_plug)
dt_func = udf (lambda x: datetime.strptime(x, form), DateType())
for col_ in cols:
df = df.withColumn(col_, dt_func(col(col_)))
df = df.replace(datetime.strptime(temp_plug, form), 'null')
return df
However, this method gives me this error
ValueError: to_replace should be a float, int, long, string, list, tuple, or dict
Can someone help me figure this out?
try this –
# Some data, I added empty strings and nulls both
data = [(1,'','2017-03-09'),(1,None,'2017-03-09'),(1,'2017-03-09','2017-03-09')]
df = spark.createDataFrame(data).toDF('id','Created','EventDate')
df.show()
:
+---+----------+----------+
| id| Created| EventDate|
+---+----------+----------+
| 1| |2017-03-09|
| 1| null|2017-03-09|
| 1|2017-03-09|2017-03-09|
+---+----------+----------+
:
df
.withColumn('Created-formatted',when((df.Created.isNull() | (df.Created == '')) ,'0')
.otherwise(unix_timestamp(df.Created,'yyyy-MM-dd')))
.withColumn('EventDate-formatted',when((df.EventDate.isNull() | (df.EventDate == '')) ,'0')
.otherwise(unix_timestamp(df.EventDate,'yyyy-MM-dd')))
.drop('Created','EventDate')
.show()
:
+---+-----------------+-------------------+
| id|Created-formatted|EventDate-formatted|
+---+-----------------+-------------------+
| 1| 0| 1489035600|
| 1| 0| 1489035600|
| 1| 1489035600| 1489035600|
+---+-----------------+-------------------+
I used unix_timestamp
which returns BigInt format but you can format that columns as you like .
Try this… just casting the column!
df_new = (df
.select(to_date(col("df.EventDate"),"yyyy-MM-dd")
.alias("EventDate-formatted")
)