Search pattern to include square brackets
Question:
I am trying to search for exact words in a file. I read the file by lines and loop through the lines to find the exact words. As the in
keyword is not suitable for finding exact words, I am using a regex pattern.
def findWord(w):
return re.compile(r'b({0})b'.format(w), flags=re.IGNORECASE).search
The problem with this function is that is doesn’t recognizes square brackets [xyz]
.
For example
findWord('data_var_cod[0]')('Cod_Byte1 = DATA_VAR_COD[0]')
returns None
whereas
findWord('data_var_cod')('Cod_Byte1 = DATA_VAR_COD')
returns <_sre.SRE_Match object at 0x0000000015622288>
Can anybody please help me to tweak the regex pattern?
Answers:
That’s because [
and ]
has special meaning. You should quote the string you’re looking for:
re.escape(regex)
Will escape the regex for you. Change your code to:
return re.compile(r'b({0})b'.format(re.escape(w)), flags=re.IGNORECASE).search
↑↑↑↑↑↑↑↑↑
You can see what re.quote
does for your string, for example:
>>> w = '[xyz]'
>>> print re.escape(w)
[xyz]
It’s because of that regex engine assume the square brackets as character class which are regex characters for get ride of this problem you need to escape your regex characters. you can use re.escape
function :
def findWord(w):
return re.compile(r'b({0})b'.format(re.escape(w)), flags=re.IGNORECASE).search
Also as a more pythonic way to get all matches you can use re.fildall()
which returns a list of matches or re.finditer
which returns an iterator contains matchobjects.
But still this way is not complete and efficient because
when you are using word boundary your inner word must contains one type characters.
>>> ss = 'hello string [processing] in python.'
>>>re.compile(r'b({0})b'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss)
>>>
>>>re.compile(r'({})'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss).group(0)
'[processing]'
So I suggest to remove the word boundaries if your words are contains none word characters.
But as a more general way you can use following regex which use positive look around that match words that surround by space or come at the end of string or leading:
r'(?: |^)({})(?=[. ]|$) '
You need a “smart” way of building the regex:
def findWord(w):
if re.match(r'w', w) and re.search(r'w$', w):
return re.compile(r'b{0}b'.format(w), flags=re.IGNORECASE).search
if not re.match(r'w', w) and not re.search(r'w$', w):
return re.compile(r'{0}'.format(w), flags=re.IGNORECASE).search
if not re.match(r'w', w) and re.search(r'w$', w):
return re.compile(r'{0}b'.format(w), flags=re.IGNORECASE).search
if re.match(r'w', w) and not re.search(r'w$', w):
return re.compile(r'b{0}'.format(w), flags=re.IGNORECASE).search
The problem is that some of your keywords will have word characters at the start only, others – at the end only, most will have word characters on both ends, and some will have non-word characters. To effectively check the word boundary, you need to know if a word character is present at the start/end of the keyword.
Thus, with re.match(r'w', x)
we can check if the keyword starts with a word character, and if yes, add the b
to the pattern, and with re.search(r'w$', x)
we can check if the keyword ends with a word character.
In case you have multiple keywords to check a string against you can check this post of mine.
You can use a
before [
or ]
.
For instance, to find 'abc[12]'
in 'xyzabc[12]def'
, one can use
match_pattern = 'abc[12]'
I am trying to search for exact words in a file. I read the file by lines and loop through the lines to find the exact words. As the in
keyword is not suitable for finding exact words, I am using a regex pattern.
def findWord(w):
return re.compile(r'b({0})b'.format(w), flags=re.IGNORECASE).search
The problem with this function is that is doesn’t recognizes square brackets [xyz]
.
For example
findWord('data_var_cod[0]')('Cod_Byte1 = DATA_VAR_COD[0]')
returns None
whereas
findWord('data_var_cod')('Cod_Byte1 = DATA_VAR_COD')
returns <_sre.SRE_Match object at 0x0000000015622288>
Can anybody please help me to tweak the regex pattern?
That’s because [
and ]
has special meaning. You should quote the string you’re looking for:
re.escape(regex)
Will escape the regex for you. Change your code to:
return re.compile(r'b({0})b'.format(re.escape(w)), flags=re.IGNORECASE).search
↑↑↑↑↑↑↑↑↑
You can see what re.quote
does for your string, for example:
>>> w = '[xyz]'
>>> print re.escape(w)
[xyz]
It’s because of that regex engine assume the square brackets as character class which are regex characters for get ride of this problem you need to escape your regex characters. you can use re.escape
function :
def findWord(w):
return re.compile(r'b({0})b'.format(re.escape(w)), flags=re.IGNORECASE).search
Also as a more pythonic way to get all matches you can use re.fildall()
which returns a list of matches or re.finditer
which returns an iterator contains matchobjects.
But still this way is not complete and efficient because
when you are using word boundary your inner word must contains one type characters.
>>> ss = 'hello string [processing] in python.'
>>>re.compile(r'b({0})b'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss)
>>>
>>>re.compile(r'({})'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss).group(0)
'[processing]'
So I suggest to remove the word boundaries if your words are contains none word characters.
But as a more general way you can use following regex which use positive look around that match words that surround by space or come at the end of string or leading:
r'(?: |^)({})(?=[. ]|$) '
You need a “smart” way of building the regex:
def findWord(w):
if re.match(r'w', w) and re.search(r'w$', w):
return re.compile(r'b{0}b'.format(w), flags=re.IGNORECASE).search
if not re.match(r'w', w) and not re.search(r'w$', w):
return re.compile(r'{0}'.format(w), flags=re.IGNORECASE).search
if not re.match(r'w', w) and re.search(r'w$', w):
return re.compile(r'{0}b'.format(w), flags=re.IGNORECASE).search
if re.match(r'w', w) and not re.search(r'w$', w):
return re.compile(r'b{0}'.format(w), flags=re.IGNORECASE).search
The problem is that some of your keywords will have word characters at the start only, others – at the end only, most will have word characters on both ends, and some will have non-word characters. To effectively check the word boundary, you need to know if a word character is present at the start/end of the keyword.
Thus, with re.match(r'w', x)
we can check if the keyword starts with a word character, and if yes, add the b
to the pattern, and with re.search(r'w$', x)
we can check if the keyword ends with a word character.
In case you have multiple keywords to check a string against you can check this post of mine.
You can use a before
[
or ]
.
For instance, to find 'abc[12]'
in 'xyzabc[12]def'
, one can use
match_pattern = 'abc[12]'