Remove substring and all characters before from pyspark column
Question:
I have a pyspark object column in a dataframe (df) like this:
| 'A' |
-------------------------
| field 1 - order - one |
| field 2 - sell |
| order |
| sell |
I’d like to remove the first occurence of ‘- ‘ and all characters before using regex_replace or whatever other sql function that would work in this case but having a little trouble. Below is the desired output:
| 'A' |
-------------------
| order - one |
| sell |
| order |
| sell |
Answers:
this should work
from pyspark.sql import functions as F
df = spark.createDataFrame(
[
("field 1 - order", "None"),
("field 2 - sell", "None"),
("order", "None"),
("sell", "None"),
],
["A", "B"],
)
df.show()
df = (
df
.withColumn("A", F.regexp_replace("A" , "^([^-]+)-" ,"",) )
)
df.show()
outputs:
+---------------+----+
| A| B|
+---------------+----+
|field 1 - order|None|
| field 2 - sell|None|
| order|None|
| sell|None|
+---------------+----+
+------+----+
| A| B|
+------+----+
| order|None|
| sell|None|
| order|None|
| sell|None|
+------+----+
Another way out is to split column A by the character and slice the resulting array and get the element. Code below
df.withColumn('A', slice(split('A','-'),-1,1)[0]).show()
I have a pyspark object column in a dataframe (df) like this:
| 'A' |
-------------------------
| field 1 - order - one |
| field 2 - sell |
| order |
| sell |
I’d like to remove the first occurence of ‘- ‘ and all characters before using regex_replace or whatever other sql function that would work in this case but having a little trouble. Below is the desired output:
| 'A' |
-------------------
| order - one |
| sell |
| order |
| sell |
this should work
from pyspark.sql import functions as F
df = spark.createDataFrame(
[
("field 1 - order", "None"),
("field 2 - sell", "None"),
("order", "None"),
("sell", "None"),
],
["A", "B"],
)
df.show()
df = (
df
.withColumn("A", F.regexp_replace("A" , "^([^-]+)-" ,"",) )
)
df.show()
outputs:
+---------------+----+
| A| B|
+---------------+----+
|field 1 - order|None|
| field 2 - sell|None|
| order|None|
| sell|None|
+---------------+----+
+------+----+
| A| B|
+------+----+
| order|None|
| sell|None|
| order|None|
| sell|None|
+------+----+
Another way out is to split column A by the character and slice the resulting array and get the element. Code below
df.withColumn('A', slice(split('A','-'),-1,1)[0]).show()