Extracting several regex matches in PySpark
Question:
I’m currently working on a regex that I want to run over a PySpark Dataframe’s column.
This regex is built to capture only one group, but could return several matches.
The problem I encounter is that it seems PySpark native regex’s functions (regexp_extract and regexp_replace) only allow for groups manipulation (through the $ operand).
Is there a way to natively (PySpark function, no python’s re.findall-based udf) fetch the list of substring matched by my regex (and I am not talking of the groups contained in the first match) ?
I wish to do something like that:
my_regex = '(w+)'
# Fetch and manipulate the resulting matches, not just the capturing group
df = df.withColumn(df.col_name, regexp_replace('col_name', my_regex, '$1[0] - $2[0]'))
With $1 representing the first match as an array, and so on…
You can try the following regex input to see an example of the matches I wish to fetch.
2 AVENUE DES LAPINOUS
It should return 4 different matches, each with 1 group within.
Answers:
Unfortunately, there is no way to get all the matches in spark. You can specify matched index using idx
func.regexp_extract('col', my_regex, idx=1)
There is an unmerged request for same which can be found here
TL;DR: As of now, you will need to write a UDF for this
In Spark 3.1+ it’s possible using regexp_extract_all
regexp_extract_all(str, regexp[, idx])
– Extract all strings in the str
that match the regexp
expression and corresponding to the regex group index.
df = spark.createDataFrame([('2 AVENUE DES LAPINOUS',)], ['col'])
df.show(truncate=False)
#+---------------------+
#|col |
#+---------------------+
#|2 AVENUE DES LAPINOUS|
#+---------------------+
df = df.withColumn('output', F.expr(r"regexp_extract_all(col, '(\w+)', 1)"))
df.show(truncate=False)
#+---------------------+--------------------------+
#|col |output |
#+---------------------+--------------------------+
#|2 AVENUE DES LAPINOUS|[2, AVENUE, DES, LAPINOUS]|
#+---------------------+--------------------------+
I’m currently working on a regex that I want to run over a PySpark Dataframe’s column.
This regex is built to capture only one group, but could return several matches.
The problem I encounter is that it seems PySpark native regex’s functions (regexp_extract and regexp_replace) only allow for groups manipulation (through the $ operand).
Is there a way to natively (PySpark function, no python’s re.findall-based udf) fetch the list of substring matched by my regex (and I am not talking of the groups contained in the first match) ?
I wish to do something like that:
my_regex = '(w+)'
# Fetch and manipulate the resulting matches, not just the capturing group
df = df.withColumn(df.col_name, regexp_replace('col_name', my_regex, '$1[0] - $2[0]'))
With $1 representing the first match as an array, and so on…
You can try the following regex input to see an example of the matches I wish to fetch.
2 AVENUE DES LAPINOUS
It should return 4 different matches, each with 1 group within.
Unfortunately, there is no way to get all the matches in spark. You can specify matched index using idx
func.regexp_extract('col', my_regex, idx=1)
There is an unmerged request for same which can be found here
TL;DR: As of now, you will need to write a UDF for this
In Spark 3.1+ it’s possible using regexp_extract_all
regexp_extract_all(str, regexp[, idx])
– Extract all strings in thestr
that match theregexp
expression and corresponding to the regex group index.
df = spark.createDataFrame([('2 AVENUE DES LAPINOUS',)], ['col'])
df.show(truncate=False)
#+---------------------+
#|col |
#+---------------------+
#|2 AVENUE DES LAPINOUS|
#+---------------------+
df = df.withColumn('output', F.expr(r"regexp_extract_all(col, '(\w+)', 1)"))
df.show(truncate=False)
#+---------------------+--------------------------+
#|col |output |
#+---------------------+--------------------------+
#|2 AVENUE DES LAPINOUS|[2, AVENUE, DES, LAPINOUS]|
#+---------------------+--------------------------+