Remove substring and all characters before from pyspark column

Question:

I have a pyspark object column in a dataframe (df) like this:

|      'A'              |
-------------------------
| field 1 - order - one |
| field 2 - sell        |
|     order             |
|     sell              |

I’d like to remove the first occurence of ‘- ‘ and all characters before using regex_replace or whatever other sql function that would work in this case but having a little trouble. Below is the desired output:

|      'A'        |
-------------------
|   order - one   |
|     sell        |
|     order       |
|     sell        |
Asked By: chicagobeast12

||

Answers:

this should work

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [
        ("field 1 - order", "None"),
        ("field 2 - sell", "None"),
        ("order", "None"),
        ("sell", "None"),
    ],
    ["A", "B"],
)
df.show()

df = (
    df
    .withColumn("A", F.regexp_replace("A" , "^([^-]+)-" ,"",)  )
)

df.show()

outputs:

+---------------+----+
|              A|   B|
+---------------+----+
|field 1 - order|None|
| field 2 - sell|None|
|          order|None|
|           sell|None|
+---------------+----+

+------+----+
|     A|   B|
+------+----+
| order|None|
|  sell|None|
| order|None|
|  sell|None|
+------+----+
Answered By: iambdot

Another way out is to split column A by the character and slice the resulting array and get the element. Code below

df.withColumn('A', slice(split('A','-'),-1,1)[0]).show()
Answered By: wwnde
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.