What is a beautiful soup bound method?
Question:
I’m experimenting with http://robobrowser.readthedocs.org/en/latest/readme.html, a new python library based on the beautiful soup library. I’m trying to test it out by opening an html page and returning it within a django app, but I can’t figure out to do this most simple task. My django app contains :
def index(request):
p=str(request.POST.get('p', False)) # p='https://www.yahoo.com/'
browser = RoboBrowser(history=True)
browser.open(p)
html = browser.find_all
return HttpResponse(html)
when I look at the outputted html I see:
<bound method BeautifulSoup.find_all of
<!DOCTYPE html>
<html>
......................
<head>
...............
</body>
</html>
>
What is a beautiful soup bound method? How can I get the straight html?
Answers:
It’s a method object, bound to the BeautifulSoup
object. You didn’t call it.
It’s representation is a little confusing because the repr()
of the BeautifulSoup parse tree is included, which is simply the tree rendered as a HTML source string.
To get to the underlying BeautifulSoup parse tree, you can use; use str()
to turn that back into a source string:
html = str(browser.state.parsed)
Alternatively, you can still access the original requests
response object with:
browser.state.response
which means that the original downloaded HTML is found as:
html = browser.state.response.content
BeautifulSoup is a Python package used for parsing HTML and XML documents, it creates a parse tree for parsed paged which can be used for web scraping.
There are many Beautifulsoup methods, which allows us to search a parse tree. If we search out of that tree it will be out of bound.
.next_sibling and .previous_sibling are the tags that are used for navigating between page elements that are on same level of the parse tree.
I’m experimenting with http://robobrowser.readthedocs.org/en/latest/readme.html, a new python library based on the beautiful soup library. I’m trying to test it out by opening an html page and returning it within a django app, but I can’t figure out to do this most simple task. My django app contains :
def index(request):
p=str(request.POST.get('p', False)) # p='https://www.yahoo.com/'
browser = RoboBrowser(history=True)
browser.open(p)
html = browser.find_all
return HttpResponse(html)
when I look at the outputted html I see:
<bound method BeautifulSoup.find_all of
<!DOCTYPE html>
<html>
......................
<head>
...............
</body>
</html>
>
What is a beautiful soup bound method? How can I get the straight html?
It’s a method object, bound to the BeautifulSoup
object. You didn’t call it.
It’s representation is a little confusing because the repr()
of the BeautifulSoup parse tree is included, which is simply the tree rendered as a HTML source string.
To get to the underlying BeautifulSoup parse tree, you can use; use str()
to turn that back into a source string:
html = str(browser.state.parsed)
Alternatively, you can still access the original requests
response object with:
browser.state.response
which means that the original downloaded HTML is found as:
html = browser.state.response.content
BeautifulSoup is a Python package used for parsing HTML and XML documents, it creates a parse tree for parsed paged which can be used for web scraping.
There are many Beautifulsoup methods, which allows us to search a parse tree. If we search out of that tree it will be out of bound.
.next_sibling and .previous_sibling are the tags that are used for navigating between page elements that are on same level of the parse tree.