How to match part of a string multiple overlapping times with regex
Question:
I need a Python regex matching the part of a string multiple times:
My String: aa-bbb-c-dd
I would like to have groups like this:
aa-bbb
bbb-c
c-dd
Does somebody have an idea on how to do this?
Answers:
You can use lookahead to get overlapping matches:
(?=b([A-Za-z]+-[A-Za-z]+)b)
See the regex demo.
Details:
(?=
– start of a positive lookahead that matches a location that is immediately followed with
b
– a word boundary
([A-Za-z]+-[A-Za-z]+)
– Group 1: one or more ASCII letters, -
, one or more ASCII letters
b
– a word boundary
)
– end of the lookahead.
In Python, use it with re.findall
:
import re
text = "aaaa-bb-ccc-dd"
print( re.findall(r'(?=b([A-Z]+-[A-Z]+)b)', text, re.I) )
# => ['aaaa-bb', 'bb-ccc', 'ccc-dd']
See the Python demo. Note I changed [A-Za-z]
to [A-Z]
in the code since I made the regex matching case insensitive with the help of the re.I
option. Make sure you are using the r
string literal prefix or b
will be treated as a BACKSPACE char, x08
, and not a word boundary.
Variations
(?=b([^Wd_]+-[^Wd_]+)b)
– matching any Unicode letters
(?=(?<![^Wd_])([^Wd_]+-[^Wd_]+)(?![^Wd_]))
– matching any Unicode letters and the boundaries are any non-letters
I need a Python regex matching the part of a string multiple times:
My String: aa-bbb-c-dd
I would like to have groups like this:
aa-bbb
bbb-c
c-dd
Does somebody have an idea on how to do this?
You can use lookahead to get overlapping matches:
(?=b([A-Za-z]+-[A-Za-z]+)b)
See the regex demo.
Details:
(?=
– start of a positive lookahead that matches a location that is immediately followed withb
– a word boundary([A-Za-z]+-[A-Za-z]+)
– Group 1: one or more ASCII letters,-
, one or more ASCII lettersb
– a word boundary
)
– end of the lookahead.
In Python, use it with re.findall
:
import re
text = "aaaa-bb-ccc-dd"
print( re.findall(r'(?=b([A-Z]+-[A-Z]+)b)', text, re.I) )
# => ['aaaa-bb', 'bb-ccc', 'ccc-dd']
See the Python demo. Note I changed [A-Za-z]
to [A-Z]
in the code since I made the regex matching case insensitive with the help of the re.I
option. Make sure you are using the r
string literal prefix or b
will be treated as a BACKSPACE char, x08
, and not a word boundary.
Variations
(?=b([^Wd_]+-[^Wd_]+)b)
– matching any Unicode letters(?=(?<![^Wd_])([^Wd_]+-[^Wd_]+)(?![^Wd_]))
– matching any Unicode letters and the boundaries are any non-letters