Split variable in Pyspark
Question:
I try to split the utc value found in timestamp_value
in a new column called utc
. I tried to use the Python RegEx but I was not able to do it.
Thank you for your answer!
This is how my dataframe looks like
+--------+----------------------------+
|machine |timestamp_value |
+--------+----------------------------+
|1 |2022-01-06T07:47:37.319+0000|
|2 |2022-01-06T07:47:37.319+0000|
|3 |2022-01-06T07:47:37.319+0000|
+--------+----------------------------+
This is how It should look like
+--------+----------------------------+-----+
|machine |timestamp_value |utc |
+--------+----------------------------------+
|1 |2022-01-06T07:47:37.319 |+0000|
|2 |2022-01-06T07:47:37.319 |+0000|
|3 |2022-01-06T07:47:37.319 |+0000|
+--------+----------------------------------+
Answers:
You can do this with with a regexp_extract
and regexp_replace
respectively
import pyspark.sql.functions as F
(df
.withColumn('utc', F.regexp_extract('timestamp_value', '.*(+.*)', 1))
.withColumn('timestamp_value', F.regexp_replace('timestamp_value', '+(.*)', ''))
).show(truncate=False)
+-------+-----------------------+-----+
|machine|timestamp_value |utc |
+-------+-----------------------+-----+
|1 |2022-01-06T07:47:37.319|+0000|
|2 |2022-01-06T07:47:37.319|+0000|
|3 |2022-01-06T07:47:37.319|+0000|
+-------+-----------------------+-----+
To better understand what that regular expression means, have a look at this tool.
I try to split the utc value found in timestamp_value
in a new column called utc
. I tried to use the Python RegEx but I was not able to do it.
Thank you for your answer!
This is how my dataframe looks like
+--------+----------------------------+
|machine |timestamp_value |
+--------+----------------------------+
|1 |2022-01-06T07:47:37.319+0000|
|2 |2022-01-06T07:47:37.319+0000|
|3 |2022-01-06T07:47:37.319+0000|
+--------+----------------------------+
This is how It should look like
+--------+----------------------------+-----+
|machine |timestamp_value |utc |
+--------+----------------------------------+
|1 |2022-01-06T07:47:37.319 |+0000|
|2 |2022-01-06T07:47:37.319 |+0000|
|3 |2022-01-06T07:47:37.319 |+0000|
+--------+----------------------------------+
You can do this with with a regexp_extract
and regexp_replace
respectively
import pyspark.sql.functions as F
(df
.withColumn('utc', F.regexp_extract('timestamp_value', '.*(+.*)', 1))
.withColumn('timestamp_value', F.regexp_replace('timestamp_value', '+(.*)', ''))
).show(truncate=False)
+-------+-----------------------+-----+
|machine|timestamp_value |utc |
+-------+-----------------------+-----+
|1 |2022-01-06T07:47:37.319|+0000|
|2 |2022-01-06T07:47:37.319|+0000|
|3 |2022-01-06T07:47:37.319|+0000|
+-------+-----------------------+-----+
To better understand what that regular expression means, have a look at this tool.