re.findall not returning full match?
Question:
I have a file that includes a bunch of strings like "size=XXX;". I am trying Python’s re
module for the first time and am a bit mystified by the following behavior: if I use a pipe for ‘or’ in a regular expression, I only see that bit of the match returned. E.g.:
>>> myfile = open('testfile.txt', 'r').read()
>>> re.findall('size=50;', myfile)
['size=50;', 'size=50;', 'size=50;', 'size=50;']
>>> re.findall('size=51;', myfile)
['size=51;', 'size=51;', 'size=51;']
>>> re.findall('size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
>>> re.findall(r'size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
The "size=" part of the match is gone (Yet it is certainly used in the search, otherwise there would be more results). What am I doing wrong?
Answers:
'size=(50|51);'
means you are looking for size=50
or size=51
but only matching the 50
or 51
part (note the parentheses), therefore it does not return the sign=
.
If you want the sign=
returned, you can do:
re.findall('(size=50|size=51);',myfile)
The problem you have is that if the regex that re.findall
tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
One way to solve this issue is to use non-capturing groups (prefixed with ?:
).
>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']
If the regex that re.findall
tries to match does not capture anything, it returns the whole of the matched string.
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.
I think what you want is using []
instead of ()
. []
indicates a set of characters while ()
indicates a group match. Try something like this:
re.findall('size=5[01];', myfile)
When a regular expression contains parentheses, they capture their contents to groups, changing the behaviour of findall()
to only return those groups. Here’s the relevant section from the docs:
(...)
Matches whatever regular expression is inside the parentheses,
and indicates the start and end of a group; the contents of a group
can be retrieved after a match has been performed, and can be matched
later in the string with the number
special sequence, described
below. To match the literals '('
or ')'
, use (
or )
, or enclose them
inside a character class: [(] [)]
.
To avoid this behaviour, you can use a non-capturing group:
>>> re.findall(r'size=(?:50|51);',myfile)
['size=51;', 'size=51;', 'size=51;', 'size=50;', 'size=50;', 'size=50;', 'size=50;']
Again, from the docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
In some cases, the non-capturing group is not appropriate, for example with regex which detects repeated words (example from python docs)
r'(bw+)s+1'
In this situation to get whole match one can use
[groups[0] for groups in re.findall(r'((bw+)s+2)', text)]
Note that 1
has changed to 2
.
Here is a clean solution: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
if the website dies here is the example (try on regex101.com):
regex:
^I like (dogs|penguins), but not (lions|tigers).$
try with:
I like dogs, but not lions.
I like dogs, but not tigers.
I like penguins, but not lions.
I like penguins, but not tigers.
Match 1
Full match 2-29 I like dogs, but not lions.
Group 1. 9-13 dogs
Group 2. 23-28 lions
…
but with regex:
^I like (?:dogs|penguins), but not (?:lions|tigers).$
Match 1
Full match 2-29 I like dogs, but not lions.
Match 2
Full match 30-58 I like dogs, but not tigers.
…
As others mentioned, the "problem" with re.findall
is that it returns a list of strings/tuples-of-strings depending on the use of capture groups. If you don’t want to change the capture groups you’re using (not to use character groups []
or non-capturing groups (?:)
), you can use finditer
instead of findall
. This gives an iterator of Match
objects, instead of just strings. So now you can fetch the full match, even when using capture groups:
import re
s = 'size=50;size=51;'
for m in re.finditer('size=(50|51);', s):
print(m.group())
Will give:
size=50;
size=51;
And if you need a list, similar to findall
, you can use a list-comprehension:
>>> [m.group() for m in re.finditer('size=(50|51);', s)]
['size=50;', 'size=51;']
I have a file that includes a bunch of strings like "size=XXX;". I am trying Python’s re
module for the first time and am a bit mystified by the following behavior: if I use a pipe for ‘or’ in a regular expression, I only see that bit of the match returned. E.g.:
>>> myfile = open('testfile.txt', 'r').read()
>>> re.findall('size=50;', myfile)
['size=50;', 'size=50;', 'size=50;', 'size=50;']
>>> re.findall('size=51;', myfile)
['size=51;', 'size=51;', 'size=51;']
>>> re.findall('size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
>>> re.findall(r'size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
The "size=" part of the match is gone (Yet it is certainly used in the search, otherwise there would be more results). What am I doing wrong?
'size=(50|51);'
means you are looking for size=50
or size=51
but only matching the 50
or 51
part (note the parentheses), therefore it does not return the sign=
.
If you want the sign=
returned, you can do:
re.findall('(size=50|size=51);',myfile)
The problem you have is that if the regex that re.findall
tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
One way to solve this issue is to use non-capturing groups (prefixed with ?:
).
>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']
If the regex that re.findall
tries to match does not capture anything, it returns the whole of the matched string.
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.
I think what you want is using []
instead of ()
. []
indicates a set of characters while ()
indicates a group match. Try something like this:
re.findall('size=5[01];', myfile)
When a regular expression contains parentheses, they capture their contents to groups, changing the behaviour of findall()
to only return those groups. Here’s the relevant section from the docs:
(...)
Matches whatever regular expression is inside the parentheses,
and indicates the start and end of a group; the contents of a group
can be retrieved after a match has been performed, and can be matched
later in the string with thenumber
special sequence, described
below. To match the literals'('
or')'
, use(
or)
, or enclose them
inside a character class:[(] [)]
.
To avoid this behaviour, you can use a non-capturing group:
>>> re.findall(r'size=(?:50|51);',myfile)
['size=51;', 'size=51;', 'size=51;', 'size=50;', 'size=50;', 'size=50;', 'size=50;']
Again, from the docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
In some cases, the non-capturing group is not appropriate, for example with regex which detects repeated words (example from python docs)
r'(bw+)s+1'
In this situation to get whole match one can use
[groups[0] for groups in re.findall(r'((bw+)s+2)', text)]
Note that 1
has changed to 2
.
Here is a clean solution: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
if the website dies here is the example (try on regex101.com):
regex:
^I like (dogs|penguins), but not (lions|tigers).$
try with:
I like dogs, but not lions.
I like dogs, but not tigers.
I like penguins, but not lions.
I like penguins, but not tigers.
Match 1
Full match 2-29 I like dogs, but not lions.
Group 1. 9-13 dogs
Group 2. 23-28 lions
…
but with regex:
^I like (?:dogs|penguins), but not (?:lions|tigers).$
Match 1
Full match 2-29 I like dogs, but not lions.
Match 2
Full match 30-58 I like dogs, but not tigers.
…
As others mentioned, the "problem" with re.findall
is that it returns a list of strings/tuples-of-strings depending on the use of capture groups. If you don’t want to change the capture groups you’re using (not to use character groups []
or non-capturing groups (?:)
), you can use finditer
instead of findall
. This gives an iterator of Match
objects, instead of just strings. So now you can fetch the full match, even when using capture groups:
import re
s = 'size=50;size=51;'
for m in re.finditer('size=(50|51);', s):
print(m.group())
Will give:
size=50;
size=51;
And if you need a list, similar to findall
, you can use a list-comprehension:
>>> [m.group() for m in re.finditer('size=(50|51);', s)]
['size=50;', 'size=51;']