Split text on markup in Python
Question:
I have the following line of text :
<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>
Using Python, I want to break the markup entities to get the following list:
['<code>', 'stuff', '</code>', ' and stuff and $\LaTeX$ ', '<pre class="mermaid">', 'stuff', '</pre>']
So far, I used :
markup = re.compile(r"(<(?P<tag>[a-z]+).*>)(.*?)(</(?P=tag)>)")
text = '<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'
words = re.split(markup, text)
but it yields :
['<code>', 'code', 'stuff', '</code>', ' and stuff and $\LaTeX$ ', '<pre class="mermaid">', 'pre', 'stuff', '</pre>']
The problem is the (?P=tag)
group is added to the list because it’s captured. I capture it only to get the closest closing tag.
How could I get rid of it in the resulting list, assuming the code processes only one single line at a time ?
Answers:
s = r'<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'
l = []
for i in range(len(s)):
if s[i] == ">":
l[-1] += s[i]
l.append("")
elif s[i] == "<":
l.append("")
l[-1] += s[i]
else:
l[-1] += s[i]
l.pop()
print(l)
Output: ['<code>', 'stuff', '</code>', ' and stuff and $\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']
You could use xml
which is a module designed for xml files
which is synonymous to html
.
import xml.etree.ElementTree as ET
text = '<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'
root = ET.fromstring(f'<root>{text}</root>')
result = []
for element in root:
if element.tag:
result.append(f'<{element.tag}>')
if element.text:
result.extend(element.text.split())
if element.tail:
result.append(element.tail)
print(result)
RegEx is not suitable for parsing HTML. However it typically suffices for tokenization. Using re.finditer
, tokenization becomes a one-liner:
list(map(lambda x: x.group(0), re.finditer(r"(?:<(?:.*?>)?)|[^<]+", s)))
Explanation:
- Use only noncapturing groups
(?:...)
; we don’t need specific captures here.
- Match either a "tag"
<(?:.*?>)?
(may be invalid (just <
sign), recognized only by its opening <
, goes until >
) or plaintext [^<]+
.
This handles your test case
s = '<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'
correctly, producing
['<code>', 'stuff', '</code>', ' and stuff and $\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']
Note however that a full-blown HTML tokenizer would need a much more complex regular grammar to handle e.g. attributes like onclick = "console.log(1 < 2)"
properly. You’d be better off using an off-the-shelf library to do the markup parsing (or even just tokenization) for you.
I have the following line of text :
<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>
Using Python, I want to break the markup entities to get the following list:
['<code>', 'stuff', '</code>', ' and stuff and $\LaTeX$ ', '<pre class="mermaid">', 'stuff', '</pre>']
So far, I used :
markup = re.compile(r"(<(?P<tag>[a-z]+).*>)(.*?)(</(?P=tag)>)")
text = '<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'
words = re.split(markup, text)
but it yields :
['<code>', 'code', 'stuff', '</code>', ' and stuff and $\LaTeX$ ', '<pre class="mermaid">', 'pre', 'stuff', '</pre>']
The problem is the (?P=tag)
group is added to the list because it’s captured. I capture it only to get the closest closing tag.
How could I get rid of it in the resulting list, assuming the code processes only one single line at a time ?
s = r'<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'
l = []
for i in range(len(s)):
if s[i] == ">":
l[-1] += s[i]
l.append("")
elif s[i] == "<":
l.append("")
l[-1] += s[i]
else:
l[-1] += s[i]
l.pop()
print(l)
Output: ['<code>', 'stuff', '</code>', ' and stuff and $\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']
You could use xml
which is a module designed for xml files
which is synonymous to html
.
import xml.etree.ElementTree as ET
text = '<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'
root = ET.fromstring(f'<root>{text}</root>')
result = []
for element in root:
if element.tag:
result.append(f'<{element.tag}>')
if element.text:
result.extend(element.text.split())
if element.tail:
result.append(element.tail)
print(result)
RegEx is not suitable for parsing HTML. However it typically suffices for tokenization. Using re.finditer
, tokenization becomes a one-liner:
list(map(lambda x: x.group(0), re.finditer(r"(?:<(?:.*?>)?)|[^<]+", s)))
Explanation:
- Use only noncapturing groups
(?:...)
; we don’t need specific captures here. - Match either a "tag"
<(?:.*?>)?
(may be invalid (just<
sign), recognized only by its opening<
, goes until>
) or plaintext[^<]+
.
This handles your test case
s = '<code>stuff</code> and stuff and $LaTeX$ and <pre class="mermaid">stuff</pre>'
correctly, producing
['<code>', 'stuff', '</code>', ' and stuff and $\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']
Note however that a full-blown HTML tokenizer would need a much more complex regular grammar to handle e.g. attributes like onclick = "console.log(1 < 2)"
properly. You’d be better off using an off-the-shelf library to do the markup parsing (or even just tokenization) for you.