Confusing: python regex does not capture a working regex pattern


I am using regex to capture a string from a word file (and many such word files). But weirdly enough, a seemingly good regex pattern (working on is not working on python.

Just in case it has something to do with the word file, I am attaching a drive link here for your reference.

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword

# setting directory

text = textract.process('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc/081204R0.doc_133.doc')
text = text.decode("utf-8")

nob = text.split('BID OPENING DATE')
del nob[0]

txt = nob[0]

engineers_estimate ='ENGINEERS ESTs+(?:^|s)(?=.)((?:0|(?:[1-9](?:d*|d{0,2}(?:,d{3})*)))?(?:.d*[0-9])?)(?!S)', txt)
if not (engineers_estimate is None):
    engineers_estimate =
    engineers_estimate = 'Not captured'

amount_under_over ='(AMOUNT (?:OVER|UNDER))s+((?:d{1,3}(?:,d{3})*(?:.dd)?))b', txt)
if not (amount_under_over is None):
    amount_under_over1 =
    amount_under_over1 = 'Not captured'

The code successfully captures the engineers_estimate variable but cannot capture anything for amount_under_over.

print(amount_uner_over) returns None.

According to this regex101 template, the code should capture the relevant amount under over string. Thank you so much!

Edit: Removing b from the pattern worked! I’m not sure why it worked though.

Asked By: Pepa



I think the problem is escape characters which are allowed in Python strings by default. You can use r before your string to indicate it is a raw string, for example:
engineers_estimate ='ENGINEERS ESTs+(?:^|s)(?=.)((?:0|(?:[1-9](?:d*|d{0,2}(?:,d{3})*)))?(?:.d*[0-9])?)(?!S)', txt)

Removing b fixed your problem because that is an escape character Backspace.

Answered By: Aleksa Majkic
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.