convert weird date string into date format in pyspark
Question:
I have a csv file with the date column. The date is coming in a strange format. Below are some examples:
- May the 9th of 2022
- September the 17th of 2022
- June the 09th of 2022
I am creating a glue job to load data into Redshift.
How do I convert these weird looking string into YYYY-MM-DD format using pyspark data frame.
Answers:
you can use to_date
and pass the source format — "MMMM 'the' d'th of' yyyy"
.
As brought up by blackbishop, use "MMMM 'the' dd['st']['nd']['rd']['th'] 'of' yyyy"
to handle all cases (1st, 2nd, 3rd, 4th …)
spark.sparkContext.parallelize([('May the 9th of 2022',), ('September the 17th of 2022',)]).toDF(['dt_str']).
withColumn('dt', func.to_date('dt_str', "MMMM 'the' d'th of' yyyy")).
show(truncate=False)
# +--------------------------+----------+
# |dt_str |dt |
# +--------------------------+----------+
# |May the 9th of 2022 |2022-05-09|
# |September the 17th of 2022|2022-09-17|
# +--------------------------+----------+
I have a csv file with the date column. The date is coming in a strange format. Below are some examples:
- May the 9th of 2022
- September the 17th of 2022
- June the 09th of 2022
I am creating a glue job to load data into Redshift.
How do I convert these weird looking string into YYYY-MM-DD format using pyspark data frame.
you can use to_date
and pass the source format — "MMMM 'the' d'th of' yyyy"
.
As brought up by blackbishop, use "MMMM 'the' dd['st']['nd']['rd']['th'] 'of' yyyy"
to handle all cases (1st, 2nd, 3rd, 4th …)
spark.sparkContext.parallelize([('May the 9th of 2022',), ('September the 17th of 2022',)]).toDF(['dt_str']).
withColumn('dt', func.to_date('dt_str', "MMMM 'the' d'th of' yyyy")).
show(truncate=False)
# +--------------------------+----------+
# |dt_str |dt |
# +--------------------------+----------+
# |May the 9th of 2022 |2022-05-09|
# |September the 17th of 2022|2022-09-17|
# +--------------------------+----------+