What is the best way to match optional whole words with python regex
Question:
I use regualr expressions frequently, but often in the same similar ways. I sometimes run across this scenario where I’d like to capture strings with optional whole words in them. I’ve come up with the method below but I suspect there’s a better way, just not sure what it is? An example is a string like this:
For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well
My goal is to capture both portions of the string beginning with the dollar sign $
and ending with either word dry
or prod
. In the example the whole word is producing
, but sometimes it’s a variation of the word such as production
, so prod
is fine. The captured results should be:
['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry', '$12,948,821.00 is the estimated total costs of such initial unit well as a prod']
which I get with this not so elegant expression:
[val[0] for val in re.findall('($[0-9,.]+[a-z ,]+total cost.*?(dry|prod)+)', line, flags=re.IGNORECASE)]
Is there a better, more correct, way to accomplish it than this?
Answers:
We can use re.findall
here:
inp = "For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well"
matches = re.findall(r'$d{1,3}(?:,d{3})*(?:.d+)?.*?b(?:dry|prod)', inp)
print(matches)
This prints:
['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry',
'$12,948,821.00 is the estimated total costs of such initial unit well as a prod']
Here is an explanation of the regex pattern being used:
$
match currency symbol $
d{1,3}
match 1 to 3 digits
(?:,d{3})*
followed by optional thousands terms
(?:.d+)?
followed by optional decimal component
.*?
match all content until reaching the nearest
b(?:dry|prod)
match dry
or prod
as a substring
I use regualr expressions frequently, but often in the same similar ways. I sometimes run across this scenario where I’d like to capture strings with optional whole words in them. I’ve come up with the method below but I suspect there’s a better way, just not sure what it is? An example is a string like this:
For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well
My goal is to capture both portions of the string beginning with the dollar sign $
and ending with either word dry
or prod
. In the example the whole word is producing
, but sometimes it’s a variation of the word such as production
, so prod
is fine. The captured results should be:
['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry', '$12,948,821.00 is the estimated total costs of such initial unit well as a prod']
which I get with this not so elegant expression:
[val[0] for val in re.findall('($[0-9,.]+[a-z ,]+total cost.*?(dry|prod)+)', line, flags=re.IGNORECASE)]
Is there a better, more correct, way to accomplish it than this?
We can use re.findall
here:
inp = "For the purposes of this order, the sum of $5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry hole and for the purposes of this order, the sum of $12,948,821.00 is the estimated total costs of such initial unit well as a producing well"
matches = re.findall(r'$d{1,3}(?:,d{3})*(?:.d+)?.*?b(?:dry|prod)', inp)
print(matches)
This prints:
['$5,476,958.00 is the estimated total costs of the initial unit well covered hereby as dry',
'$12,948,821.00 is the estimated total costs of such initial unit well as a prod']
Here is an explanation of the regex pattern being used:
$
match currency symbol$
d{1,3}
match 1 to 3 digits(?:,d{3})*
followed by optional thousands terms(?:.d+)?
followed by optional decimal component.*?
match all content until reaching the nearestb(?:dry|prod)
matchdry
orprod
as a substring