How do I regex match with grouping with unknown number of groups

Question:

I want to do a regex match (in Python) on the output log of a program. The log contains some lines that look like this:

... 
VALUE 100 234 568 9233 119
... 
VALUE 101 124 9223 4329 1559
...

I would like to capture the list of numbers that occurs after the first incidence of the line that starts with VALUE. i.e., I want it to return ('100','234','568','9233','119'). The problem is that I do not know in advance how many numbers there will be.

I tried to use this as a regex:

VALUE (?:(d+)s)+

This matches the line, but it only captures the last value, so I just get (‘119’,).

Asked By: Lorin Hochstein

||

Answers:

What you’re looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():

s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
    print [int(x) for x in a[1:]]

You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE" and knowing that the int(x) conversion will always succeed since you’ve already confirmed that the following character groups are all digits.

Answered By: Greg Hewgill

You could just run you’re main match regex then run a secondary regex on those matches to get the numbers:

matches = Regex.Match(log)

foreach (Match match in matches)
{
    submatches = Regex2.Match(match)
}

This is of course also if you don’t want to write a full parser.

Answered By: Chris J
>>> import re
>>> reg = re.compile('d+')
>>> reg.findall('VALUE 100 234 568 9233 119')
['100', '234', '568', '9223', '119']

That doesn’t validate that the keyword ‘VALUE’ appears at the beginning of the string, and it doesn’t validate that there is exactly one space between items, but if you can do that as a separate step (or if you don’t need to do that at all), then it will find all digit sequences in any string.

Answered By: Ian Clelland

Another option not described here is to have a bunch of optional capturing groups.

VALUE *(d+)? *(d+)? *(d+)? *(d+)? *(d+)? *$

This regex captures up to 5 digit groups separated by spaces. If you need more potential groups, just copy and paste more *(d+)? blocks.

Answered By: Scottmas

I had this same problem and my solution was to use two regular expressions: the first one to match the whole group I’m interested in and the second one to parse the sub groups. For example in this case, I’d start with this:

VALUE((sd+)+)

This should result in three matches: [0] the whole line, [1] the stuff after value [2] the last space+value.

[0] and [2] can be ignored and then [1] can be used with the following:

s(d+)

Note: these regexps were not tested, I hope you get the idea though.


The reason why Greg’s answer doesn’t work for me is because the 2nd part of the parsing is more complicated and not simply some numbers separated by a space.

However, I would honestly go with Greg’s solution for this question (it’s probably way more efficient).

I’m just writing this answer in case someone is looking for a more sophisticated solution like I needed.

Answered By: Christian

You can use re.match to check first and call re.split to use a regex as separator to split.

>>> s = "VALUE 100 234 568 9233 119"
>>> sep = r"s+"
>>> reg = re.compile(r"VALUE(%sd+)+"%(sep)) # OR r"VALUE(s+d+)+"
>>> reg_sep = re.compile(sep)
>>> if reg.match(s): # OR re.match(r"VALUE(s+d+)+", s)
...     result = reg_sep.split(s)[1:] # OR re.split(r"s+", s)[1:]
>>> result
['100', '234', '568', '9233', '119']

The separator "s+" can be more complicated.

Answered By: H. Chan
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.