Convert HTML entities to Unicode and vice versa
Question:
How do you convert HTML entities to Unicode and vice versa in Python?
Answers:
You need to have BeautifulSoup.
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
As to the “vice versa” (which I needed myself, leading me to find this question, which didn’t help, and subsequently another site which had the answer):
u'some string'.encode('ascii', 'xmlcharrefreplace')
will return a plain string with any non-ascii characters turned into XML (HTML) entities.
As hekevintran answer suggests, you may use cgi.escape(s)
for encoding stings, but notice that encoding of quote is false by default in that function and it may be a good idea to pass the quote=True
keyword argument alongside your string. But even by passing quote=True
, the function won’t escape single quotes ("'"
) (Because of these issues the function has been deprecated since version 3.2)
It’s been suggested to use html.escape(s)
instead of cgi.escape(s)
. (New in version 3.2)
Also html.unescape(s)
has been introduced in version 3.4.
So in python 3.4 you can:
- Use
html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()
to convert special characters to HTML entities.
- And
html.unescape(text)
for converting HTML entities back to plain-text representations.
Update for Python 2.7 and BeautifulSoup4
Unescape — Unicode HTML to unicode with htmlparser
(Python 2.7 standard lib):
>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Curxe9 of the xabNotre-Dame-de-Grxe2cexbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Unescape — Unicode HTML to unicode with bs4
(BeautifulSoup4):
>>> html = '''<p>Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Curxe9 of the xabNotre-Dame-de-Grxe2cexbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Escape — Unicode to unicode HTML with bs4
(BeautifulSoup4):
>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
I used the following function to convert unicode ripped from an xls file into a an html file while conserving the special characters found in the xls file:
def html_wr(f, dat):
''' write dat to file f as html
. file is assumed to be opened in binary format
. if dat is nul it is replaced with non breakable space
. non-ascii characters are translated to xml
'''
if not dat:
dat = ' '
try:
f.write(dat.encode('ascii'))
except:
f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))
hope this is useful to somebody
If someone like me is out there wondering why some entity numbers (codes) like ™ (for trademark symbol), € (for euro symbol)
are not encoded properly, the reason is in ISO-8859-1 (aka Windows-1252) those characters are not defined.
Also note that, the default character set as of html5 is utf-8 it was ISO-8859-1 for html4
So, we will have to workaround somehow (find & replace those at first)
Reference (starting point) from Mozilla’s documentation
https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings
$ python3 -c "
> import html
> print(
> html.unescape('&©—')
> )"
&©—
$ python3 -c "
> import html
> print(
> html.escape('&©—')
> )"
&©—
$ python2 -c "
> from HTMLParser import HTMLParser
> print(
> HTMLParser().unescape('&©—')
> )"
&©—
$ python2 -c "
> import cgi
> print(
> cgi.escape('&©—')
> )"
&©—
HTML only strictly requires &
(ampersand) and <
(left angle bracket / less-than sign) to be escaped. https://html.spec.whatwg.org/multipage/parsing.html#data-state
#!/usr/bin/env python3
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('n')))
For python3
use html.unescape()
:
import html
s = "&"
decoded = html.unescape(s)
# &
How do you convert HTML entities to Unicode and vice versa in Python?
You need to have BeautifulSoup.
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
As to the “vice versa” (which I needed myself, leading me to find this question, which didn’t help, and subsequently another site which had the answer):
u'some string'.encode('ascii', 'xmlcharrefreplace')
will return a plain string with any non-ascii characters turned into XML (HTML) entities.
As hekevintran answer suggests, you may use cgi.escape(s)
for encoding stings, but notice that encoding of quote is false by default in that function and it may be a good idea to pass the quote=True
keyword argument alongside your string. But even by passing quote=True
, the function won’t escape single quotes ("'"
) (Because of these issues the function has been deprecated since version 3.2)
It’s been suggested to use html.escape(s)
instead of cgi.escape(s)
. (New in version 3.2)
Also html.unescape(s)
has been introduced in version 3.4.
So in python 3.4 you can:
- Use
html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()
to convert special characters to HTML entities. - And
html.unescape(text)
for converting HTML entities back to plain-text representations.
Update for Python 2.7 and BeautifulSoup4
Unescape — Unicode HTML to unicode with htmlparser
(Python 2.7 standard lib):
>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Curxe9 of the xabNotre-Dame-de-Grxe2cexbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Unescape — Unicode HTML to unicode with bs4
(BeautifulSoup4):
>>> html = '''<p>Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Curxe9 of the xabNotre-Dame-de-Grxe2cexbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Escape — Unicode to unicode HTML with bs4
(BeautifulSoup4):
>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
I used the following function to convert unicode ripped from an xls file into a an html file while conserving the special characters found in the xls file:
def html_wr(f, dat):
''' write dat to file f as html
. file is assumed to be opened in binary format
. if dat is nul it is replaced with non breakable space
. non-ascii characters are translated to xml
'''
if not dat:
dat = ' '
try:
f.write(dat.encode('ascii'))
except:
f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))
hope this is useful to somebody
If someone like me is out there wondering why some entity numbers (codes) like ™ (for trademark symbol), € (for euro symbol)
are not encoded properly, the reason is in ISO-8859-1 (aka Windows-1252) those characters are not defined.
Also note that, the default character set as of html5 is utf-8 it was ISO-8859-1 for html4
So, we will have to workaround somehow (find & replace those at first)
Reference (starting point) from Mozilla’s documentation
https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings
$ python3 -c "
> import html
> print(
> html.unescape('&©—')
> )"
&©—
$ python3 -c "
> import html
> print(
> html.escape('&©—')
> )"
&©—
$ python2 -c "
> from HTMLParser import HTMLParser
> print(
> HTMLParser().unescape('&©—')
> )"
&©—
$ python2 -c "
> import cgi
> print(
> cgi.escape('&©—')
> )"
&©—
HTML only strictly requires &
(ampersand) and <
(left angle bracket / less-than sign) to be escaped. https://html.spec.whatwg.org/multipage/parsing.html#data-state
#!/usr/bin/env python3
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('n')))
For python3
use html.unescape()
:
import html
s = "&"
decoded = html.unescape(s)
# &