Regex pattern skips last matches and misses content with parenthesis

Question:

Say I have a string:

r'pat1=a, pat2=b, (e, e*89=f), bb, pat3=c, pat4=hi, pat10=ex'

I need to extract patterns as:

pat1=a, 
pat2=b, (e, e*89=f), bb, 
pat3=c, 
pat4=hi, 
pat10=ex

This is the pattern I tried:

re.findall(r'(patd*.*?)[(patd*)|$]', s)

which gives me:

['pat1=', 'pat2=b, ', 'pat3=c, ', 'pat1']

I am more interested in knowing how exactly my pattern is working here that it did not match the required string. Also what could be the solution.

Asked By: Himanshuman

||

Answers:

The pattern that you tried (patd*.*?)[(patd*)|$] matches pat and optional digits, then as least as possible chars until it matches one of the listed characters in the character class [(patd*)|$]

To get your desired matches, you don’t want to match anything after .*? but you want to assert either the start of a part with the same pattern for pat.

And for the last part, you can assert the end of the string.


You could write the pattern as:

bpatd+=.*?(?=s*bpatd+=|$)

The pattern matches:

  • bpatd+= Match the word pat followed by 1+ digits and =
  • .*? Match as least chars as possible
  • (?= Positive lookahead, assert to the right
    • s*bpatd+= Match optional whitespace chars, then pat, 1+ digits and =
    • | Or
    • $ Assert the end of the string for the last part
  • ) Close the lookahead

Regex demo

Answered By: The fourth bird
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.