Fixing HTML Tags within XML files using Python

Question:

I have been given .htm files that are structured in XML format and have HTML tags within them. The issue is that alot of these HTML tags along the way have been converted. For example & lt; has been converted to <, & amp; has been converted to & etc. Is there a python module that is able fix these HTML entities kindof like: HTML Corrector

For example:

<Employee>
  <name> Adam</name
  <age> > 24 </age>
  <Nicknames> A & B </Nicknames>
</Employee>

In this above example, the > in age would be converted to ‘& gt;’ and the & would converted to ‘& amp;’

Desired Result:

<Employee>
  <name> Adam</name
  <age> &gt; 24 </age>
  <Nicknames> A &amp; B </Nicknames>
</Employee>
Asked By: smilelife

||

Answers:

If the HTML is well-formed, you can just convert to a BeautifulSoup object (from beautifulsoup4) and the inner text of each tag will be escaped:

my_html = 
"""<Employee>
<name> Adam</name>
<age> > 24 </age>
<Nicknames> A & B </Nicknames>
</Employee>"""

soup = BeautifulSoup(my_html)
print(soup)

Outputs:

<employee>
<name> Adam</name>
<age> &gt; 24 </age>
<nicknames> A &amp; B </nicknames>
</employee>

Not sure if this was intentional, but the exact example you provided includes a broken tag, </name without the closing >. You’d need to fix this which is tricker—you could maybe use a regular expression. This gets the correct output for your example:

import re
from bs4 import BeautifulSoup

my_html = 
"""<Employee>
<name> Adam</name
<age> > 24 </age>
<Nicknames> A & B </Nicknames>
</Employee>"""

my_html = re.sub(r"</([^>]*)(s)", r"<1>2", my_html)
soup = BeautifulSoup(my_html)
print(soup)
Answered By: ljdyer
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.