How do you find multiple matches of a string between two different tokens with python regex?

Question:

I’m having trouble creating a regex expression supported by python to handle this use case.

Imagine you have a text string that is a set of questions and multiple choice answers:

Question 1: What witch-like attributes do you have?
Answer 1:
x Hat
o Pointy Nose
x Float
x Weigh more than a duck

Question 2: Where could this coconut have come from?
Answer 2:
o It migrated
x A European swallow carried it
o An African swallow carried it
x It doesn't matter

… and you would like to parse the above text for only the "x" answers to Question 1 using Regex.

If you had access to PCRE you could do something like this using the G (last match) anchor:

(?:G(?!^)|Question 1:)(?:(?!Question 1:|Question 2:)[sS])*?K(?:xs)([a-z]+)(?=(?:(?!Question 1:)[sS])*Question 2:)

…or maybe even something fun using subroutines (e.g., (textbetweentokens)(?1)(textwithx).

But python doesn’t support either of those regex features.

Is there any other way to solving this regex challenge?

Note: There are other questions like this on stackoverflow, but none that I could find that had answers that were usable with python-supported regex.

Asked By: Adam Brand

||

Answers:

You have to split your text to line to use str.startswith()

texte = """Question 1: What witch-like attributes do you have?
Answer 1:
x Hat
o Pointy Nose
x Float
x Weigh more than a duck

Question 2: Where could this coconut have come from?
Answer 2:
o It migrated
x A European swallow carried it
o An African swallow carried it
x It doesn't matter"""

lines = texte.splitlines()
for l in lines:
    if l.startswith('x'):
        print(l)

Output:

x Hat
x Float
x Weigh more than a duck
x A European swallow carried it
x It doesn't matter
Answered By: Tourelou

You could match each line that starts with "x" but include a look-ahead assertion that checks that the next question is question 2:

^xs(.*)(?=s+(?:^(?!Question).*s+)*^Question 2)

Use the re.M flag so ^ matches with the start of a line.

This assumes of course that the question that precedes question 2 is question 1.

import re

s = """Question 1: What witch-like attributes do you have?
Answer 1:
x Hat
o Pointy Nose
x Float
x Weigh more than a duck

Question 2: Where could this coconut have come from?
Answer 2:
o It migrated
x A European swallow carried it
o An African swallow carried it
x It doesn't matter
"""0

answers = re.findall(r"^xs(.*)(?=s+(?:^(?!Question).*s+)*^Question 2)", s, re.M)
print(answers)

Output:

['Hat', 'Float', 'Weigh more than a duck']

Explanations

The (?!Question) is a negative look ahead so to avoid that there are intermediate questions before reaching Question 2. For instance, if we actually wanted the answers to question 4, we would look for "Question 5", but we should be sure not to pick up the answers to the first three questions. This negative look ahead assertion will make sure that doesn’t happen.

The (.*) is the capture group that will be retained in the findall results. If you want an answer to have at least one character, then you could change that to (.+), but I guess you either don’t have empty answers in your input, or else would like to know about them, so that’s why I chose for (.*).

Answered By: trincot
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.