regex: Find all groups of consecutive groups, where the groups are separated by pattern

Question

I have a badly parsed text where multiple text blocks are separated by lines with only three digits. What I want is to get a regex that would help me capture all the text in a block (starting and including the three digits row until the last white space before the next three characters.

This is the one I’ve tried, but as it uses a lookahead the last group is not captured.
n*((d{3})n*([Ss]+?)(?=sd{3}s))

Sample:

foo
000

foo bar
foo

461

long
multiline
text

999

last example
until rest of document

Expected groups:

[000

foo bar
foo
] Group 1
[461

long
multiline
text
] Group 2
[999

last example
until rest of document] Group 3

Asked By: Jano

||

Source

Answer 1

Does this solve your problem? You need to add "$" to match the last group. "$" means the end of the text.

import re

pattern = r'(d{3}(.|n|r)*?)(?=d{3}|$)'

for match in re.finditer(pattern, text):
    print(match.group())
    print('=' * 50)

Output:

000

foo bar
foo


==================================================
461

long
multiline
text


==================================================
999

last example
until rest of document
==================================================

Answered By: Ted Nguyen

Answer 2

You can use a negative lookahead to match all lines that do not contain the starting group token. This way the end of file is not a problem.

(^d{3}$(?:(?!^d{3}$)[sS])+)

Analising it:

(^d{3}$(?:(?!^d{3}$)[sS])+) Our only group. Every match will contain one
^d{3}$ The token that marks the start of a group. 3 digits alone in a line
(?:(?!^d{3}$)[sS])+ The rest of the group. Match all consecutive characters that match the rule, but don’t capture them one by one (?:xxx)
(?!^d{3}$)[sS]) Match a character including linebreaks [sS] that are not succeeded by the group start token.

Try it in regexr

I used the answer https://superuser.com/questions/1279062/regex-matching-line-not-containing-the-string#1279115 to "match all lines that don’t contain a string"

Answered By: raneq

regex: Find all groups of consecutive groups, where the groups are separated by pattern

Question:

Answers: