Confusing: python regex does not capture a working regex pattern
Question:
I am using regex to capture a string from a word file (and many such word files). But weirdly enough, a seemingly good regex pattern (working on regex101.com) is not working on python.
Just in case it has something to do with the word file, I am attaching a drive link here for your reference.
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
# setting directory
os.chdir('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc-test')
text = textract.process('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc/081204R0.doc_133.doc')
text = text.decode("utf-8")
nob = text.split('BID OPENING DATE')
del nob[0]
txt = nob[0]
engineers_estimate = re.search('ENGINEERS ESTs+(?:^|s)(?=.)((?:0|(?:[1-9](?:d*|d{0,2}(?:,d{3})*)))?(?:.d*[0-9])?)(?!S)', txt)
if not (engineers_estimate is None):
engineers_estimate = engineers_estimate.group(1)
else:
engineers_estimate = 'Not captured'
amount_under_over = re.search('(AMOUNT (?:OVER|UNDER))s+((?:d{1,3}(?:,d{3})*(?:.dd)?))b', txt)
if not (amount_under_over is None):
amount_under_over1 = amount_under_over.group(2)
else:
amount_under_over1 = 'Not captured'
The code successfully captures the engineers_estimate
variable but cannot capture anything for amount_under_over
.
print(amount_uner_over)
returns None
.
According to this regex101 template, the code should capture the relevant amount under over string. Thank you so much!
Edit: Removing b
from the pattern worked! I’m not sure why it worked though.
Answers:
I think the problem is escape characters which are allowed in Python strings by default. You can use r
before your string to indicate it is a raw string, for example:
engineers_estimate = re.search(r'ENGINEERS ESTs+(?:^|s)(?=.)((?:0|(?:[1-9](?:d*|d{0,2}(?:,d{3})*)))?(?:.d*[0-9])?)(?!S)', txt)
Removing b
fixed your problem because that is an escape character Backspace.
I am using regex to capture a string from a word file (and many such word files). But weirdly enough, a seemingly good regex pattern (working on regex101.com) is not working on python.
Just in case it has something to do with the word file, I am attaching a drive link here for your reference.
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
# setting directory
os.chdir('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc-test')
text = textract.process('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc/081204R0.doc_133.doc')
text = text.decode("utf-8")
nob = text.split('BID OPENING DATE')
del nob[0]
txt = nob[0]
engineers_estimate = re.search('ENGINEERS ESTs+(?:^|s)(?=.)((?:0|(?:[1-9](?:d*|d{0,2}(?:,d{3})*)))?(?:.d*[0-9])?)(?!S)', txt)
if not (engineers_estimate is None):
engineers_estimate = engineers_estimate.group(1)
else:
engineers_estimate = 'Not captured'
amount_under_over = re.search('(AMOUNT (?:OVER|UNDER))s+((?:d{1,3}(?:,d{3})*(?:.dd)?))b', txt)
if not (amount_under_over is None):
amount_under_over1 = amount_under_over.group(2)
else:
amount_under_over1 = 'Not captured'
The code successfully captures the engineers_estimate
variable but cannot capture anything for amount_under_over
.
print(amount_uner_over)
returns None
.
According to this regex101 template, the code should capture the relevant amount under over string. Thank you so much!
Edit: Removing b
from the pattern worked! I’m not sure why it worked though.
I think the problem is escape characters which are allowed in Python strings by default. You can use r
before your string to indicate it is a raw string, for example:
engineers_estimate = re.search(r'ENGINEERS ESTs+(?:^|s)(?=.)((?:0|(?:[1-9](?:d*|d{0,2}(?:,d{3})*)))?(?:.d*[0-9])?)(?!S)', txt)
Removing b
fixed your problem because that is an escape character Backspace.