Why does grouping in a regex cause a partial match?

Question:

I was trying a simple regex search to check for validity of an IPv6 address. I first tried a simple example for searching simple hex characters in a 4 block system.

For eg:

The string – acbe:abfe:aaee:afec

I first used the following regex which is working fine:

Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile("[a-f]{4}:[a-f]{4}:[a-f]{4}:[a-f]{4}")
>>> s = "acbe:abfe:aaee:afec"
>>> r.findall(s)
['acbe:abfe:aaee:afec']

Then I tried a different regex since it is repeating:

>>> r = re.compile("([a-f]{4}:){3}[a-f]{4}")
>>> r.findall(s)
['aaee:']

Note only part of the address is returned. When tested on the regex testing website RegexPal, it matches the full addresss.

Why isn’t the whole address matched? Doesn’t python support grouping of complex regex?

Asked By: Kartik Anand

||

Answers:

You need to change your compile line to:

r = re.compile("(?:[a-f]{4}:){3}[a-f]{4}")

When you include groups in your regex, then regex functions (including findall) return groups instead of the entire match. In this case, since it matches 3 times, the result from the last group that matched, which will be the 3rd piece, will be returned.

Adding ?: to the regex causes to be a non-capturing group. This lets you group it for multiple matching, while not letting findall actually capture it. Since now there are no captured groups, findall will return the entire string.

Edit: It appears to work here in python 2.6:

s = "acbe:abfe:aaee:afec"
r.findall(s)
['acbe:abfe:aaee:afec']
Answered By: Corley Brigman

I’m assuming you’re trying to get each four-letter string? You want the findall to return ['acbe','abfe','aaee','afec']?

>>> r = re.compile(r"[a-f]{4}(?=:)|(?<=:)[a-f]{4}")
>>> s = "acbe:abfe:aaee:afec"
>>> r.findall(s)
['acbe', 'abfe', 'aaee', 'afec']
Answered By: Adam Smith

In "[a-f]{4}:[a-f]{4}:[a-f]{4}:[a-f]{4}" there is no group defined, so re.findall() returns all the groups 0 , that is to say the entires matches, that it detects.

In "([a-f]{4}:){3}[a-f]{4}" , there is one group defined, and re.findall() returns all the portions of the matches that correspond to this group. BUt as this group is repeated, only the last occurence of this group in each total match is returned.

Putting ?: just after the opening paren of the group makes it a non-capturing group, then re.findall() still returns all the entire matches

Answered By: eyquem
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.