Extract string between HTML tags with RegEx where the open one has attribute in it

Question

This is about Python’s re module. Related to the example here I would like to extract everything between the main tag.

<main attr="value">
<foo>bar</foo>
</main>

The expected output

<foo>bar</foo>

My problem while building the regex pattern is the attribute in the opening tag:

<main attr="value">
     ^^^^^^^^^^^^^

I’m not sure how to express this with regex. Without the attribute (<main>) this regex do work:

(?s)<main>(.+?)</main>

I assume it is something with .* but I didn’t get it. How can I ignore the string between <main and > in the first line?

This question is not about HTML but about a specific regex problem. The HTML part is just for illustration. I use this in unittests. I’m aware that this isn’t stable in productive use. But I’m the producer of that HTML so I know what is comming there. I won’t blow my code or dependencies to parse HTML. I have good reasons to do it that way.

Asked By: buhtz

||

Source

Answer 1

The following code extract everything between a main tag if it contain or not an open attribute and only match for the main tag.

Using re lib :

import re

text = "<main attr='value'><foo>bar</foo></main>"

match = re.search(r"(?s)<main(?: [^>]+)?>(.+?)</main>", text)

if match:
    print(match.group(1))

Answered By: executable

Extract string between HTML tags with RegEx where the open one has attribute in it

Question:

Answers: