Split text on markup in Python

Question:

I have the following line of text :

<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>

Using Python, I want to break the markup entities to get the following list:

['<code>', 'stuff', '</code>', ' and stuff and $\LaTeX$ ', '<pre class="mermaid">', 'stuff', '</pre>']

So far, I used :

markup = re.compile(r"(<(?P<tag>[a-z]+).*>)(.*?)(</(?P=tag)>)")
text = '<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'
words = re.split(markup, text)

but it yields :

['<code>', 'code', 'stuff', '</code>', ' and stuff and $\LaTeX$ ', '<pre class="mermaid">', 'pre', 'stuff', '</pre>']

The problem is the (?P=tag) group is added to the list because it’s captured. I capture it only to get the closest closing tag.

How could I get rid of it in the resulting list, assuming the code processes only one single line at a time ?

Asked By: Aurélien Pierre

||

Answers:

s = r'<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'

l = []

for i in range(len(s)):
    if s[i] == ">":
        l[-1] += s[i]
        l.append("")
    elif s[i] == "<":
        l.append("")
        l[-1] += s[i]
    else:
        l[-1] += s[i]
        
l.pop()
print(l)

Output: ['<code>', 'stuff', '</code>', ' and stuff and $\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']

Answered By: Shub

You could use xml which is a module designed for xml files which is synonymous to html.

import xml.etree.ElementTree as ET

text = '<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'

root = ET.fromstring(f'<root>{text}</root>')

result = []

for element in root:
    if element.tag:
        result.append(f'<{element.tag}>')
    if element.text:
        result.extend(element.text.split())
    if element.tail:
        result.append(element.tail)

print(result)
Answered By: ApaxPhoenix

RegEx is not suitable for parsing HTML. However it typically suffices for tokenization. Using re.finditer, tokenization becomes a one-liner:

list(map(lambda x: x.group(0), re.finditer(r"(?:<(?:.*?>)?)|[^<]+", s)))

Explanation:

  • Use only noncapturing groups (?:...); we don’t need specific captures here.
  • Match either a "tag" <(?:.*?>)? (may be invalid (just < sign), recognized only by its opening <, goes until >) or plaintext [^<]+.

This handles your test case

s = '<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'

correctly, producing

['<code>', 'stuff', '</code>', ' and stuff and $\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']

Note however that a full-blown HTML tokenizer would need a much more complex regular grammar to handle e.g. attributes like onclick = "console.log(1 < 2)" properly. You’d be better off using an off-the-shelf library to do the markup parsing (or even just tokenization) for you.

Answered By: Luatic
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.