Split variable in Pyspark

Question

I try to split the utc value found in timestamp_value in a new column called utc. I tried to use the Python RegEx but I was not able to do it.
Thank you for your answer!

This is how my dataframe looks like

+--------+----------------------------+
|machine |timestamp_value             |
+--------+----------------------------+
|1       |2022-01-06T07:47:37.319+0000|
|2       |2022-01-06T07:47:37.319+0000|
|3       |2022-01-06T07:47:37.319+0000|
+--------+----------------------------+

This is how It should look like

+--------+----------------------------+-----+
|machine |timestamp_value             |utc  |
+--------+----------------------------------+
|1       |2022-01-06T07:47:37.319     |+0000|
|2       |2022-01-06T07:47:37.319     |+0000|
|3       |2022-01-06T07:47:37.319     |+0000|
+--------+----------------------------------+

Asked By: Gaaaa

||

Source

Answer 1

You can do this with with a regexp_extract and regexp_replace respectively

import pyspark.sql.functions as F

(df
 .withColumn('utc', F.regexp_extract('timestamp_value', '.*(+.*)', 1))
 .withColumn('timestamp_value', F.regexp_replace('timestamp_value', '+(.*)', ''))
).show(truncate=False)

+-------+-----------------------+-----+
|machine|timestamp_value        |utc  |
+-------+-----------------------+-----+
|1      |2022-01-06T07:47:37.319|+0000|
|2      |2022-01-06T07:47:37.319|+0000|
|3      |2022-01-06T07:47:37.319|+0000|
+-------+-----------------------+-----+

To better understand what that regular expression means, have a look at this tool.

Answered By: Ric S

Split variable in Pyspark

Question:

Answers: