Regex pattern skips last matches and misses content with parenthesis
Question:
Say I have a string:
r'pat1=a, pat2=b, (e, e*89=f), bb, pat3=c, pat4=hi, pat10=ex'
I need to extract patterns as:
pat1=a,
pat2=b, (e, e*89=f), bb,
pat3=c,
pat4=hi,
pat10=ex
This is the pattern I tried:
re.findall(r'(patd*.*?)[(patd*)|$]', s)
which gives me:
['pat1=', 'pat2=b, ', 'pat3=c, ', 'pat1']
I am more interested in knowing how exactly my pattern is working here that it did not match the required string. Also what could be the solution.
Answers:
The pattern that you tried (patd*.*?)[(patd*)|$]
matches pat
and optional digits, then as least as possible chars until it matches one of the listed characters in the character class [(patd*)|$]
To get your desired matches, you don’t want to match anything after .*?
but you want to assert either the start of a part with the same pattern for pat
.
And for the last part, you can assert the end of the string.
You could write the pattern as:
bpatd+=.*?(?=s*bpatd+=|$)
The pattern matches:
bpatd+=
Match the word pat
followed by 1+ digits and =
.*?
Match as least chars as possible
(?=
Positive lookahead, assert to the right
s*bpatd+=
Match optional whitespace chars, then pat
, 1+ digits and =
|
Or
$
Assert the end of the string for the last part
)
Close the lookahead
Say I have a string:
r'pat1=a, pat2=b, (e, e*89=f), bb, pat3=c, pat4=hi, pat10=ex'
I need to extract patterns as:
pat1=a,
pat2=b, (e, e*89=f), bb,
pat3=c,
pat4=hi,
pat10=ex
This is the pattern I tried:
re.findall(r'(patd*.*?)[(patd*)|$]', s)
which gives me:
['pat1=', 'pat2=b, ', 'pat3=c, ', 'pat1']
I am more interested in knowing how exactly my pattern is working here that it did not match the required string. Also what could be the solution.
The pattern that you tried (patd*.*?)[(patd*)|$]
matches pat
and optional digits, then as least as possible chars until it matches one of the listed characters in the character class [(patd*)|$]
To get your desired matches, you don’t want to match anything after .*?
but you want to assert either the start of a part with the same pattern for pat
.
And for the last part, you can assert the end of the string.
You could write the pattern as:
bpatd+=.*?(?=s*bpatd+=|$)
The pattern matches:
bpatd+=
Match the wordpat
followed by 1+ digits and=
.*?
Match as least chars as possible(?=
Positive lookahead, assert to the rights*bpatd+=
Match optional whitespace chars, thenpat
, 1+ digits and=
|
Or$
Assert the end of the string for the last part
)
Close the lookahead