What is the best way to match optional whole words with python regex

Question:

I use regualr expressions frequently, but often in the same similar ways. I sometimes run across this scenario where I’d like to capture strings with optional whole words in them. I’ve come up with the method below but I suspect there’s a better way, just not sure what it is? An example is a string like this:

For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well

My goal is to capture both portions of the string beginning with the dollar sign $ and ending with either word dry or prod. In the example the whole word is producing, but sometimes it’s a variation of the word such as production, so prod is fine. The captured results should be:

['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry', '$12,948,821.00 is the estimated total costs of such initial unit well as a prod']

which I get with this not so elegant expression:
[val[0] for val in re.findall('($[0-9,.]+[a-z ,]+total cost.*?(dry|prod)+)', line, flags=re.IGNORECASE)]

Is there a better, more correct, way to accomplish it than this?

Asked By: reb

||

Answers:

We can use re.findall here:

inp = "For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well"
matches = re.findall(r'$d{1,3}(?:,d{3})*(?:.d+)?.*?b(?:dry|prod)', inp)
print(matches)

This prints:

['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry',
 '$12,948,821.00 is the estimated total costs of such initial unit well as a prod']

Here is an explanation of the regex pattern being used:

  • $ match currency symbol $
  • d{1,3} match 1 to 3 digits
  • (?:,d{3})* followed by optional thousands terms
  • (?:.d+)? followed by optional decimal component
  • .*? match all content until reaching the nearest
  • b(?:dry|prod) match dry or prod as a substring
Answered By: Tim Biegeleisen
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.