Search pattern to include square brackets

Question:

I am trying to search for exact words in a file. I read the file by lines and loop through the lines to find the exact words. As the in keyword is not suitable for finding exact words, I am using a regex pattern.

def findWord(w):
    return re.compile(r'b({0})b'.format(w), flags=re.IGNORECASE).search

The problem with this function is that is doesn’t recognizes square brackets [xyz].

For example

findWord('data_var_cod[0]')('Cod_Byte1 = DATA_VAR_COD[0]') 

returns None whereas

findWord('data_var_cod')('Cod_Byte1 = DATA_VAR_COD') 

returns <_sre.SRE_Match object at 0x0000000015622288>

Can anybody please help me to tweak the regex pattern?

Asked By: BitsNPieces

||

Answers:

That’s because [ and ] has special meaning. You should quote the string you’re looking for:

re.escape(regex)

Will escape the regex for you. Change your code to:

return re.compile(r'b({0})b'.format(re.escape(w)), flags=re.IGNORECASE).search
                                      ↑↑↑↑↑↑↑↑↑

You can see what re.quote does for your string, for example:

>>> w = '[xyz]'
>>> print re.escape(w)
[xyz]
Answered By: Maroun

It’s because of that regex engine assume the square brackets as character class which are regex characters for get ride of this problem you need to escape your regex characters. you can use re.escape function :

def findWord(w):
    return re.compile(r'b({0})b'.format(re.escape(w)), flags=re.IGNORECASE).search

Also as a more pythonic way to get all matches you can use re.fildall() which returns a list of matches or re.finditer which returns an iterator contains matchobjects.

But still this way is not complete and efficient because
when you are using word boundary your inner word must contains one type characters.

>>> ss = 'hello string [processing] in python.'  
>>>re.compile(r'b({0})b'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss)
>>> 
>>>re.compile(r'({})'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss).group(0)
'[processing]'

So I suggest to remove the word boundaries if your words are contains none word characters.

But as a more general way you can use following regex which use positive look around that match words that surround by space or come at the end of string or leading:

r'(?: |^)({})(?=[. ]|$) '
Answered By: Mazdak

You need a “smart” way of building the regex:

def findWord(w):
    if re.match(r'w', w) and re.search(r'w$', w):
        return re.compile(r'b{0}b'.format(w), flags=re.IGNORECASE).search
    if not re.match(r'w', w) and not re.search(r'w$', w):
        return re.compile(r'{0}'.format(w), flags=re.IGNORECASE).search
    if not re.match(r'w', w) and re.search(r'w$', w):
        return re.compile(r'{0}b'.format(w), flags=re.IGNORECASE).search
    if re.match(r'w', w) and not re.search(r'w$', w):
        return re.compile(r'b{0}'.format(w), flags=re.IGNORECASE).search

The problem is that some of your keywords will have word characters at the start only, others – at the end only, most will have word characters on both ends, and some will have non-word characters. To effectively check the word boundary, you need to know if a word character is present at the start/end of the keyword.

Thus, with re.match(r'w', x) we can check if the keyword starts with a word character, and if yes, add the b to the pattern, and with re.search(r'w$', x) we can check if the keyword ends with a word character.

In case you have multiple keywords to check a string against you can check this post of mine.

Answered By: Wiktor Stribiżew

You can use a before [ or ].

For instance, to find 'abc[12]' in 'xyzabc[12]def', one can use

match_pattern = 'abc[12]'
Answered By: Pushpanshu
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.