Can overlapping matches with the same start position be found using regex?

Question:

I am looking for a regex or a regex flag in python/BigQuery that enables me to find overlapping occurrences.

For example, I have the string 1.2.5.6.8.10.12

and I would like to extract:
[1., 1.2., 1.2.5., 1.2.5.6., ..., 1.2.5.6.8.10.12]

I tried running the python code
re.findall("^(d+(?:.|$))+", string)
and it resulted in [’12’]

Asked By: Gilgo

||

Answers:

While the regex parser walks down the string each position gets consumed. To extract substrings with the same starting position it would be needed to look behind and capture matches towards start. Capturing overlapping matches needs to be done inside a lookaround for not consuming the captured parts. Python re does not support lookbehinds of variable length but PyPI regex does.

import regex as re

res = re.findall(r"(?<=(.*d(?:.|$)))", s)

See this Python demo at tio.run or a Regex101 demo (captures will be in the first group).

In PyPI there is even an overlapped=True option which lets avoid to capture inside the lookbehind. Together with (?r) another interesting flag for doing a reverse search it could also be achieved.

res = re.findall(r'(?r).*d(?:.|$)', s, overlapped=True)[::-1]

The result just needs to be reversed afterwards for receiving the desired order: Python demo


Using standard re an idea can be to reverse the string and do capturing inside a lookahead. The needed parts get captured from the reversed string and finally each list item gets reversed again before reversing the entire list. I don’t know if this is worth the effort but it seems to work as well.

res = [x[::-1] for x in re.findall(r'(?=((?:.d|^).*))', s[::-1])][::-1]

Another Python demo at tio.run or a Regex101 demo (shows matching on the reversed string).

Answered By: bobble bubble

Use below (BigQuery)

select text, 
  array(
    select regexp_extract(text, r'((?:[^.]+.){' || i || '})')
    from unnest(generate_array(1, array_length(split(text, '.')))) i
  ) as extracted
from your_table               

with output

enter image description here

Answered By: Mikhail Berlyant