capture mullti-line groups

Question:

I would like to extract each NAME_ group’s information using regex (Python3).
For example, I have a text like

AB_ NAME_ 111 "fruit";
AB_ EX_ 111 first_fruit "banana";
AB_ EX_ 111 second_fruit_info "Do you like 
apple

or grape?";
AB_ EX_ 111 third_fruit "tomato";
AB_ NAME_ 120 "food";
AB_ NAME_ 130 "clothes";
AB_ EX_ 130 first_clothes "t-shirt"; 

and the result I want to get is three groups:
1)

AB_ NAME_ 111 "fruit";
AB_ EX_ 111 first_fruit "banana";
AB_ EX_ 111 second_fruit_info "Do you like 
apple

or grape?";
AB_ EX_ 111 third_fruit "tomato";
AB_ NAME_ 120 "food";
AB_ NAME_ 130 "clothes";
AB_ EX_ 130 first_clothes "t-shirt"; 

which are split by their ID (Name_ ID).

I tried to capture AB_ NAME_ info followed by zero or more AB_ EX_ info as below but it failed. I also went with ‘re.S’, ‘re.M’ flags but didn’t work well.

AB_ NAME_ d+ .+;n(AB_ EX_ d+ (.|n)+;n)*
Asked By: Stella

||

Answers:

You should use re.DOTALL to make all next line symbols to be matched with . and then you can use findall() to get all results, like this:

import re

text = """AB_ NAME_ 111 "fruit";
AB_ EX_ 111 first_fruit "banana";
AB_ EX_ 111 second_fruit_info "Do you like
apple

or grape?";
AB_ EX_ 111 third_fruit "tomato";
AB_ NAME_ 120 "food";
AB_ NAME_ 130 "clothes";
AB_ EX_ 130 first_clothes "t-shirt";"""

regex = r"AB_ NAME_.*?(?=AB_ NAME_|$)"

print(re.findall(regex, text, re.DOTALL))

The regex pattern is this: AB_ NAME_.*?(?=AB_ NAME_|$)

This part (?=AB_ NAME_|$) searches for the next AB_ NAME_ or end of the line (in your case end of the entire string).

Answered By: SnoopFrog
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.