Don't put html, head and body tags automatically, beautifulsoup
Question:
I’m using beautifulsoup with html5lib, it puts the html
, head
and body
tags automatically:
BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>
Is there any option that I can set, turn off this behavior ?
Answers:
You may have misunderstood BeautifulSoup here. BeautifulSoup deals with whole HTML documents, not with HTML fragments. What you see is by design.
Without a <html>
and <body>
tag, your HTML document is broken. BeautifulSoup leaves it to the specific parser to repair such a document, and different parsers differ in how much they can repair. html5lib
is the most thorough of the parsers, but you’ll get similar results with the lxml
parser (but lxml
leaves out the <head>
tag). The html.parser
parser is the least capable, it can do some repair work but it doesn’t add back required but missing tags.
So this is a deliberate feature of the html5lib
library, it fixes HTML that is lacking, such as adding back in missing required elements.
There is not option for BeautifulSoup to treat the HTML you pass in as a fragment. At most you can ‘break’ the document and remove the <html>
and <body>
elements again with the standard BeautifulSoup tree manipulation methods.
E.g. using Element.replace_with()
lets you replace the html
element with your <h1>
element:
>>> soup = BeautifulSoup('<h1>FOO</h1>', 'html5lib')
>>> soup
<html><head></head><body><h1>FOO</h1></body></html>
>>> soup.html.replace_with(soup.body.contents[0])
<html><head></head><body></body></html>
>>> soup
<h1>FOO</h1>
Take into account however, that html5lib
can add other elements to your tree too, such as tbody
elements:
>>> BeautifulSoup(
... '<table><tr><td>Foo</td><td>Bar</td></tr></table>', 'html5lib'
... ).table
<table><tbody><tr><td>Foo</td><td>Bar</td></tr></tbody></table>
The HTML standard states that a table should always have a <tbody>
element, and if it is missing, a parser should treat the document as if the element is there anyway. html5lib
follows the standard very, very closely.
In [35]: import bs4 as bs
In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>
This parses the HTML with Python’s builtin HTML parser.
Quoting the docs:
Unlike html5lib, this parser makes no attempt to create a well-formed
HTML document by adding a <body>
tag. Unlike lxml, it doesn’t even
bother to add an <html>
tag.
Alternatively, you could use the html5lib
parser and just select the element after <body>
:
In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')
In [62]: soup.body.next
Out[62]: <h1>FOO</h1>
Yet another solution:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello <a href="http://google.com">Google</a></p><p>Hi!</p>', 'lxml')
# content handling example (just for example)
# replace Google with StackOverflow
for a in soup.findAll('a'):
a['href'] = 'http://stackoverflow.com/'
a.string = 'StackOverflow'
print ''.join([unicode(i) for i in soup.html.body.findChildren(recursive=False)])
Let’s first create a soup sample:
soup=BeautifulSoup("<head></head><body><p>content</p></body>")
You could get html and body’s child by specify soup.body.<tag>
:
# python3: get body's first child
print(next(soup.body.children))
# if first child's tag is rss
print(soup.body.rss)
Also you could use unwrap() to remove body, head, and html
soup.html.body.unwrap()
if soup.html.select('> head'):
soup.html.head.unwrap()
soup.html.unwrap()
If you load xml file, bs4.diagnose(data)
will tell you to use lxml-xml
, which will not wrap your soup with html+body
>>> BS('<foo>xxx</foo>', 'lxml-xml')
<foo>xxx</foo>
If you want it to look better, try this:
BeautifulSoup([contents you want to analyze].prettify())
This aspect of BeautifulSoup has always annoyed the hell out of me.
Here’s how I deal with it:
# Parse the initial html-formatted string
soup = BeautifulSoup(html, 'lxml')
# Do stuff here
# Extract a string repr of the parse html object, without the <html> or <body> tags
html = "".join([str(x) for x in soup.body.children])
A quick breakdown:
# Iterator object of all tags within the <body> tag (your html before parsing)
soup.body.children
# Turn each element into a string object, rather than a BS4.Tag object
# Note: inclusive of html tags
str(x)
# Get a List of all html nodes as string objects
[str(x) for x in soup.body.children]
# Join all the string objects together to recreate your original html
"".join()
I still don’t like this, but it gets the job done. I always run into this when I use BS4 to filter certain elements and/or attributes from HTML documents before doing something else with them where I need the entire object back as a string repr rather than a BS4 parsed object.
Hopefully, the next time I Google this, I’ll find my answer here.
Since v4.0.1 there’s a method decode_contents()
:
>>> BeautifulSoup('<h1>FOO</h1>', 'html5lib').body.decode_contents()
'<h1>FOO</h1>'
More details in a solution to this question:
https://stackoverflow.com/a/18602241/237105
Update:
As rightfully noted by @MartijnPieters in the comments this way you’ll still get some extra tags like tbody
(in the tables) which you might or might not want.
Here is how I do it
a = BeautifulSoup()
a.append(a.new_tag('section'))
#this will give you <section></section>
html=str(soup)
html=html.replace("<html><body>","")
html=html.replace("</body></html>","")
will remove the html/body tag bracket. A more sophisticated version would also check for startsWith, endsWith …
I’m using beautifulsoup with html5lib, it puts the html
, head
and body
tags automatically:
BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>
Is there any option that I can set, turn off this behavior ?
You may have misunderstood BeautifulSoup here. BeautifulSoup deals with whole HTML documents, not with HTML fragments. What you see is by design.
Without a <html>
and <body>
tag, your HTML document is broken. BeautifulSoup leaves it to the specific parser to repair such a document, and different parsers differ in how much they can repair. html5lib
is the most thorough of the parsers, but you’ll get similar results with the lxml
parser (but lxml
leaves out the <head>
tag). The html.parser
parser is the least capable, it can do some repair work but it doesn’t add back required but missing tags.
So this is a deliberate feature of the html5lib
library, it fixes HTML that is lacking, such as adding back in missing required elements.
There is not option for BeautifulSoup to treat the HTML you pass in as a fragment. At most you can ‘break’ the document and remove the <html>
and <body>
elements again with the standard BeautifulSoup tree manipulation methods.
E.g. using Element.replace_with()
lets you replace the html
element with your <h1>
element:
>>> soup = BeautifulSoup('<h1>FOO</h1>', 'html5lib')
>>> soup
<html><head></head><body><h1>FOO</h1></body></html>
>>> soup.html.replace_with(soup.body.contents[0])
<html><head></head><body></body></html>
>>> soup
<h1>FOO</h1>
Take into account however, that html5lib
can add other elements to your tree too, such as tbody
elements:
>>> BeautifulSoup(
... '<table><tr><td>Foo</td><td>Bar</td></tr></table>', 'html5lib'
... ).table
<table><tbody><tr><td>Foo</td><td>Bar</td></tr></tbody></table>
The HTML standard states that a table should always have a <tbody>
element, and if it is missing, a parser should treat the document as if the element is there anyway. html5lib
follows the standard very, very closely.
In [35]: import bs4 as bs
In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>
This parses the HTML with Python’s builtin HTML parser.
Quoting the docs:
Unlike html5lib, this parser makes no attempt to create a well-formed
HTML document by adding a<body>
tag. Unlike lxml, it doesn’t even
bother to add an<html>
tag.
Alternatively, you could use the html5lib
parser and just select the element after <body>
:
In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')
In [62]: soup.body.next
Out[62]: <h1>FOO</h1>
Yet another solution:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello <a href="http://google.com">Google</a></p><p>Hi!</p>', 'lxml')
# content handling example (just for example)
# replace Google with StackOverflow
for a in soup.findAll('a'):
a['href'] = 'http://stackoverflow.com/'
a.string = 'StackOverflow'
print ''.join([unicode(i) for i in soup.html.body.findChildren(recursive=False)])
Let’s first create a soup sample:
soup=BeautifulSoup("<head></head><body><p>content</p></body>")
You could get html and body’s child by specify soup.body.<tag>
:
# python3: get body's first child
print(next(soup.body.children))
# if first child's tag is rss
print(soup.body.rss)
Also you could use unwrap() to remove body, head, and html
soup.html.body.unwrap()
if soup.html.select('> head'):
soup.html.head.unwrap()
soup.html.unwrap()
If you load xml file, bs4.diagnose(data)
will tell you to use lxml-xml
, which will not wrap your soup with html+body
>>> BS('<foo>xxx</foo>', 'lxml-xml')
<foo>xxx</foo>
If you want it to look better, try this:
BeautifulSoup([contents you want to analyze].prettify())
This aspect of BeautifulSoup has always annoyed the hell out of me.
Here’s how I deal with it:
# Parse the initial html-formatted string
soup = BeautifulSoup(html, 'lxml')
# Do stuff here
# Extract a string repr of the parse html object, without the <html> or <body> tags
html = "".join([str(x) for x in soup.body.children])
A quick breakdown:
# Iterator object of all tags within the <body> tag (your html before parsing)
soup.body.children
# Turn each element into a string object, rather than a BS4.Tag object
# Note: inclusive of html tags
str(x)
# Get a List of all html nodes as string objects
[str(x) for x in soup.body.children]
# Join all the string objects together to recreate your original html
"".join()
I still don’t like this, but it gets the job done. I always run into this when I use BS4 to filter certain elements and/or attributes from HTML documents before doing something else with them where I need the entire object back as a string repr rather than a BS4 parsed object.
Hopefully, the next time I Google this, I’ll find my answer here.
Since v4.0.1 there’s a method decode_contents()
:
>>> BeautifulSoup('<h1>FOO</h1>', 'html5lib').body.decode_contents()
'<h1>FOO</h1>'
More details in a solution to this question:
https://stackoverflow.com/a/18602241/237105
Update:
As rightfully noted by @MartijnPieters in the comments this way you’ll still get some extra tags like tbody
(in the tables) which you might or might not want.
Here is how I do it
a = BeautifulSoup()
a.append(a.new_tag('section'))
#this will give you <section></section>
html=str(soup)
html=html.replace("<html><body>","")
html=html.replace("</body></html>","")
will remove the html/body tag bracket. A more sophisticated version would also check for startsWith, endsWith …