Other parsed contents of a static HTML webpage?
Question:
I am creating a Python web-scraper, and I have it print the title
and span
of the web-page I enter. I’ve been looking around, but cannot find other elements to a web-page.
Are there any other portions of a website which Python can access using bs4
/ BeautifulSoup
/ requests
?
I’ve found a head
element, but I’m sure there has to be more.
Answers:
Here is a list of HTML tags you can find. In bs4, you generally use the find
or findAll
methods to scrape a page. The first parameter of these functions is the name of the tag you are in search for. Here are some examples of how to use the findAll method: https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#The%20basic%20find%20method:%20findAll(name,%20attrs,%20recursive,%20text,%20limit,%20**kwargs)
(Stackoverflow would not let me paste the link as a hyperlink)
Alternatively you can traverse the document tree like so:
def walker(soup):
if soup.name is not None:
for child in soup.children:
#process node
print str(child.name) + ":" + str(type(child))
walker(child)
walker(soup)
taken from: http://makble.com/parsing-and-traversing-dom-tree-with-beautifulsoup
This goes through each node in the tree from the root, <html>
in a depth-first search. This is done by recursively looking at the children of each node, then the children’s children and so on.
scrape_link = 'https://www.imdb.com/chart/top/'
page = requests.get(scrape_link)
soup = BeautifulSoup(page.content, 'html.parser')
movie_data = {}
links = soup.select('div.article div.lister table.chart.full-width tbody.lister-list tr td.titleColumn a')
for anchor in links:
movie_title = anchor.get_text()
movie_link = "https://www.imdb.com" + anchor['href']
also, page.content prints the html
I am creating a Python web-scraper, and I have it print the title
and span
of the web-page I enter. I’ve been looking around, but cannot find other elements to a web-page.
Are there any other portions of a website which Python can access using bs4
/ BeautifulSoup
/ requests
?
I’ve found a head
element, but I’m sure there has to be more.
Here is a list of HTML tags you can find. In bs4, you generally use the find
or findAll
methods to scrape a page. The first parameter of these functions is the name of the tag you are in search for. Here are some examples of how to use the findAll method: https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#The%20basic%20find%20method:%20findAll(name,%20attrs,%20recursive,%20text,%20limit,%20**kwargs)
(Stackoverflow would not let me paste the link as a hyperlink)
Alternatively you can traverse the document tree like so:
def walker(soup):
if soup.name is not None:
for child in soup.children:
#process node
print str(child.name) + ":" + str(type(child))
walker(child)
walker(soup)
taken from: http://makble.com/parsing-and-traversing-dom-tree-with-beautifulsoup
This goes through each node in the tree from the root, <html>
in a depth-first search. This is done by recursively looking at the children of each node, then the children’s children and so on.
scrape_link = 'https://www.imdb.com/chart/top/'
page = requests.get(scrape_link)
soup = BeautifulSoup(page.content, 'html.parser')
movie_data = {}
links = soup.select('div.article div.lister table.chart.full-width tbody.lister-list tr td.titleColumn a')
for anchor in links:
movie_title = anchor.get_text()
movie_link = "https://www.imdb.com" + anchor['href']
also, page.content prints the html