Equivalent to InnerHTML when using lxml.html to parse HTML
Question:
I’m working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.
I would like to know what the most sensible way in the library is to do the equivalent of Javascript’s InnerHtml – that is, to retrieve or set the complete contents of a tag.
<body>
<h1>A title</h1>
<p>Some text</p>
</body>
InnerHtml is therefore:
<h1>A title</h1>
<p>Some text</p>
I can do it using hacks (converting to string/regexes etc) but I’m assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.
EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:
from lxml import html
from cStringIO import StringIO
t = html.parse(StringIO(
"""<body>
<h1>A title</h1>
<p>Some text</p>
Untagged text
<p>
Unclosed p tag
</body>"""))
root = t.getroot()
body = root.body
print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])
Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.
Answers:
You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
... print etree.tostring(child)
...
<h1>A title</h1>
<p>Some text</p>
This can be shorthanded as follows:
print ''.join([etree.tostring(child) for child in root.iterdescendants()])
Sorry for bringing this up again, but I’ve been looking for a solution and yours contains a bug:
<body>This text is ignored
<h1>Title</h1><p>Some text</p></body>
Text directly under the root element is ignored. I ended up doing this:
(body.text or '') +
''.join([html.tostring(child) for child in body.iterchildren()])
import lxml.etree as ET
body = t.xpath("//body");
for tag in body:
h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
p = html.fromstring( ET.tostring(tag[1]) ).xpath("//p");
htext = h[0].text_content();
ptext = h[0].text_content();
you can also use .get('href')
for a tag and .attrib
for attribute ,
here tag no is hardcoded but you can also do this dynamic
Here is a Python 3 version:
from xml.sax import saxutils
from lxml import html
def inner_html(tree):
""" Return inner HTML of lxml element """
return (saxutils.escape(tree.text) if tree.text else '') +
''.join([html.tostring(child, encoding=str) for child in tree.iterchildren()])
Note that this includes escaping of the initial text as recommended by andreymal — this is needed to avoid tag injection if you’re working with sanitized HTML!
I find none of the answers satisfying, some are even in Python 2. So I add a one-liner solution that produces innerHTML-like output and works with Python 3:
from lxml import etree, html
# generate some HTML element node
node = html.fromstring("""<container>
Some random text <b>bold <i>italic</i> yeah</b> no yeah
<!-- comment blah blah --> <img src='gaga.png' />
</container>""")
# compute inner HTML of element
innerHTML = "".join([
str(c) if type(c)==etree._ElementUnicodeResult
else html.tostring(c, with_tail=False).decode()
for c in node.xpath("node()")
]).strip()
The result will be:
'Some random text <b>bold <i>italic</i> yeah</b> no yeahn<!-- comment blah blah --> <img src="gaga.png">'
What it does: The xpath delivers all node children (text, elements, comments). The list comprehension produces a list of the text contents of the text nodes and HTML content of element nodes. Those are then joined into a single string. If you want to get rid of comments, use *|text()
instead of node()
for xpath.
I’m working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.
I would like to know what the most sensible way in the library is to do the equivalent of Javascript’s InnerHtml – that is, to retrieve or set the complete contents of a tag.
<body>
<h1>A title</h1>
<p>Some text</p>
</body>
InnerHtml is therefore:
<h1>A title</h1>
<p>Some text</p>
I can do it using hacks (converting to string/regexes etc) but I’m assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.
EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:
from lxml import html
from cStringIO import StringIO
t = html.parse(StringIO(
"""<body>
<h1>A title</h1>
<p>Some text</p>
Untagged text
<p>
Unclosed p tag
</body>"""))
root = t.getroot()
body = root.body
print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])
Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.
You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
... print etree.tostring(child)
...
<h1>A title</h1>
<p>Some text</p>
This can be shorthanded as follows:
print ''.join([etree.tostring(child) for child in root.iterdescendants()])
Sorry for bringing this up again, but I’ve been looking for a solution and yours contains a bug:
<body>This text is ignored
<h1>Title</h1><p>Some text</p></body>
Text directly under the root element is ignored. I ended up doing this:
(body.text or '') +
''.join([html.tostring(child) for child in body.iterchildren()])
import lxml.etree as ET
body = t.xpath("//body");
for tag in body:
h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
p = html.fromstring( ET.tostring(tag[1]) ).xpath("//p");
htext = h[0].text_content();
ptext = h[0].text_content();
you can also use .get('href')
for a tag and .attrib
for attribute ,
here tag no is hardcoded but you can also do this dynamic
Here is a Python 3 version:
from xml.sax import saxutils
from lxml import html
def inner_html(tree):
""" Return inner HTML of lxml element """
return (saxutils.escape(tree.text) if tree.text else '') +
''.join([html.tostring(child, encoding=str) for child in tree.iterchildren()])
Note that this includes escaping of the initial text as recommended by andreymal — this is needed to avoid tag injection if you’re working with sanitized HTML!
I find none of the answers satisfying, some are even in Python 2. So I add a one-liner solution that produces innerHTML-like output and works with Python 3:
from lxml import etree, html
# generate some HTML element node
node = html.fromstring("""<container>
Some random text <b>bold <i>italic</i> yeah</b> no yeah
<!-- comment blah blah --> <img src='gaga.png' />
</container>""")
# compute inner HTML of element
innerHTML = "".join([
str(c) if type(c)==etree._ElementUnicodeResult
else html.tostring(c, with_tail=False).decode()
for c in node.xpath("node()")
]).strip()
The result will be:
'Some random text <b>bold <i>italic</i> yeah</b> no yeahn<!-- comment blah blah --> <img src="gaga.png">'
What it does: The xpath delivers all node children (text, elements, comments). The list comprehension produces a list of the text contents of the text nodes and HTML content of element nodes. Those are then joined into a single string. If you want to get rid of comments, use *|text()
instead of node()
for xpath.