regex: Find all groups of consecutive groups, where the groups are separated by pattern
Question:
I have a badly parsed text where multiple text blocks are separated by lines with only three digits. What I want is to get a regex that would help me capture all the text in a block (starting and including the three digits row until the last white space before the next three characters.
This is the one I’ve tried, but as it uses a lookahead the last group is not captured.
n*((d{3})n*([Ss]+?)(?=sd{3}s))
Sample:
foo
000
foo bar
foo
461
long
multiline
text
999
last example
until rest of document
Expected groups:
[000
foo bar
foo
] Group 1
[461
long
multiline
text
] Group 2
[999
last example
until rest of document] Group 3
Answers:
Does this solve your problem? You need to add "$" to match the last group. "$" means the end of the text.
import re
pattern = r'(d{3}(.|n|r)*?)(?=d{3}|$)'
for match in re.finditer(pattern, text):
print(match.group())
print('=' * 50)
Output:
000
foo bar
foo
==================================================
461
long
multiline
text
==================================================
999
last example
until rest of document
==================================================
You can use a negative lookahead to match all lines that do not contain the starting group token. This way the end of file is not a problem.
(^d{3}$(?:(?!^d{3}$)[sS])+)
Analising it:
(^d{3}$(?:(?!^d{3}$)[sS])+)
Our only group. Every match will contain one
^d{3}$
The token that marks the start of a group. 3 digits alone in a line
(?:(?!^d{3}$)[sS])+
The rest of the group. Match all consecutive characters that match the rule, but don’t capture them one by one (?:xxx)
(?!^d{3}$)[sS])
Match a character including linebreaks [sS]
that are not succeeded by the group start token.
I used the answer https://superuser.com/questions/1279062/regex-matching-line-not-containing-the-string#1279115 to "match all lines that don’t contain a string"
I have a badly parsed text where multiple text blocks are separated by lines with only three digits. What I want is to get a regex that would help me capture all the text in a block (starting and including the three digits row until the last white space before the next three characters.
This is the one I’ve tried, but as it uses a lookahead the last group is not captured.
n*((d{3})n*([Ss]+?)(?=sd{3}s))
Sample:
foo
000
foo bar
foo
461
long
multiline
text
999
last example
until rest of document
Expected groups:
[000
foo bar
foo
] Group 1
[461
long
multiline
text
] Group 2
[999
last example
until rest of document] Group 3
Does this solve your problem? You need to add "$" to match the last group. "$" means the end of the text.
import re
pattern = r'(d{3}(.|n|r)*?)(?=d{3}|$)'
for match in re.finditer(pattern, text):
print(match.group())
print('=' * 50)
Output:
000
foo bar
foo
==================================================
461
long
multiline
text
==================================================
999
last example
until rest of document
==================================================
You can use a negative lookahead to match all lines that do not contain the starting group token. This way the end of file is not a problem.
(^d{3}$(?:(?!^d{3}$)[sS])+)
Analising it:
(^d{3}$(?:(?!^d{3}$)[sS])+)
Our only group. Every match will contain one^d{3}$
The token that marks the start of a group. 3 digits alone in a line(?:(?!^d{3}$)[sS])+
The rest of the group. Match all consecutive characters that match the rule, but don’t capture them one by one(?:xxx)
(?!^d{3}$)[sS])
Match a character including linebreaks[sS]
that are not succeeded by the group start token.
I used the answer https://superuser.com/questions/1279062/regex-matching-line-not-containing-the-string#1279115 to "match all lines that don’t contain a string"