Don't put html, head and body tags automatically, beautifulsoup

Question:

I’m using beautifulsoup with html5lib, it puts the html, head and body tags automatically:

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>

Is there any option that I can set, turn off this behavior ?

Asked By: Bengineer

||

Answers:

You may have misunderstood BeautifulSoup here. BeautifulSoup deals with whole HTML documents, not with HTML fragments. What you see is by design.

Without a <html> and <body> tag, your HTML document is broken. BeautifulSoup leaves it to the specific parser to repair such a document, and different parsers differ in how much they can repair. html5lib is the most thorough of the parsers, but you’ll get similar results with the lxml parser (but lxml leaves out the <head> tag). The html.parser parser is the least capable, it can do some repair work but it doesn’t add back required but missing tags.

So this is a deliberate feature of the html5lib library, it fixes HTML that is lacking, such as adding back in missing required elements.

There is not option for BeautifulSoup to treat the HTML you pass in as a fragment. At most you can ‘break’ the document and remove the <html> and <body> elements again with the standard BeautifulSoup tree manipulation methods.

E.g. using Element.replace_with() lets you replace the html element with your <h1> element:

>>> soup = BeautifulSoup('<h1>FOO</h1>', 'html5lib')
>>> soup
<html><head></head><body><h1>FOO</h1></body></html>
>>> soup.html.replace_with(soup.body.contents[0])
<html><head></head><body></body></html>
>>> soup
<h1>FOO</h1>

Take into account however, that html5lib can add other elements to your tree too, such as tbody elements:

>>> BeautifulSoup(
...     '<table><tr><td>Foo</td><td>Bar</td></tr></table>', 'html5lib'
... ).table
<table><tbody><tr><td>Foo</td><td>Bar</td></tr></tbody></table>

The HTML standard states that a table should always have a <tbody> element, and if it is missing, a parser should treat the document as if the element is there anyway. html5lib follows the standard very, very closely.

Answered By: Martijn Pieters
In [35]: import bs4 as bs

In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>

This parses the HTML with Python’s builtin HTML parser.
Quoting the docs:

Unlike html5lib, this parser makes no attempt to create a well-formed
HTML document by adding a <body> tag. Unlike lxml, it doesn’t even
bother to add an <html> tag.


Alternatively, you could use the html5lib parser and just select the element after <body>:

In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')

In [62]: soup.body.next
Out[62]: <h1>FOO</h1>
Answered By: unutbu

Yet another solution:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello <a href="http://google.com">Google</a></p><p>Hi!</p>', 'lxml')
# content handling example (just for example)
# replace Google with StackOverflow
for a in soup.findAll('a'):
  a['href'] = 'http://stackoverflow.com/'
  a.string = 'StackOverflow'
print ''.join([unicode(i) for i in soup.html.body.findChildren(recursive=False)])
Answered By: userlond

Let’s first create a soup sample:

soup=BeautifulSoup("<head></head><body><p>content</p></body>")

You could get html and body’s child by specify soup.body.<tag>:

# python3: get body's first child
print(next(soup.body.children))

# if first child's tag is rss
print(soup.body.rss)

Also you could use unwrap() to remove body, head, and html

soup.html.body.unwrap()
if soup.html.select('> head'):
    soup.html.head.unwrap()
soup.html.unwrap()

If you load xml file, bs4.diagnose(data) will tell you to use lxml-xml, which will not wrap your soup with html+body

>>> BS('<foo>xxx</foo>', 'lxml-xml')
<foo>xxx</foo>
Answered By: ahuigo

If you want it to look better, try this:

BeautifulSoup([contents you want to analyze].prettify())

Answered By: Jaylin

This aspect of BeautifulSoup has always annoyed the hell out of me.

Here’s how I deal with it:

# Parse the initial html-formatted string
soup = BeautifulSoup(html, 'lxml')

# Do stuff here

# Extract a string repr of the parse html object, without the <html> or <body> tags
html = "".join([str(x) for x in soup.body.children])

A quick breakdown:

# Iterator object of all tags within the <body> tag (your html before parsing)
soup.body.children

# Turn each element into a string object, rather than a BS4.Tag object
# Note: inclusive of html tags
str(x)

# Get a List of all html nodes as string objects
[str(x) for x in soup.body.children]

# Join all the string objects together to recreate your original html
"".join()

I still don’t like this, but it gets the job done. I always run into this when I use BS4 to filter certain elements and/or attributes from HTML documents before doing something else with them where I need the entire object back as a string repr rather than a BS4 parsed object.

Hopefully, the next time I Google this, I’ll find my answer here.

Answered By: alphazwest

Since v4.0.1 there’s a method decode_contents():

>>> BeautifulSoup('<h1>FOO</h1>', 'html5lib').body.decode_contents()
'<h1>FOO</h1>' 

More details in a solution to this question:
https://stackoverflow.com/a/18602241/237105

Update:

As rightfully noted by @MartijnPieters in the comments this way you’ll still get some extra tags like tbody (in the tables) which you might or might not want.

Answered By: Antony Hatchkins

Here is how I do it

a = BeautifulSoup()
a.append(a.new_tag('section'))
#this will give you <section></section>
Answered By: Mahmoud Hanora
html=str(soup)
html=html.replace("<html><body>","")
html=html.replace("</body></html>","")

will remove the html/body tag bracket. A more sophisticated version would also check for startsWith, endsWith …

Answered By: Wolfgang Fahl
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.