Other parsed contents of a static HTML webpage?

Question

I am creating a Python web-scraper, and I have it print the title and span of the web-page I enter. I’ve been looking around, but cannot find other elements to a web-page.

Are there any other portions of a website which Python can access using bs4 / BeautifulSoup / requests?

I’ve found a head element, but I’m sure there has to be more.

Asked By: user11389575

||

Source

Answer 1

Here is a list of HTML tags you can find. In bs4, you generally use the find or findAll methods to scrape a page. The first parameter of these functions is the name of the tag you are in search for. Here are some examples of how to use the findAll method: https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#The%20basic%20find%20method:%20findAll(name,%20attrs,%20recursive,%20text,%20limit,%20**kwargs) (Stackoverflow would not let me paste the link as a hyperlink)

Alternatively you can traverse the document tree like so:

def walker(soup):
    if soup.name is not None:
        for child in soup.children:
            #process node
            print str(child.name) + ":" + str(type(child)) 
            walker(child)

walker(soup)

taken from: http://makble.com/parsing-and-traversing-dom-tree-with-beautifulsoup

This goes through each node in the tree from the root, <html> in a depth-first search. This is done by recursively looking at the children of each node, then the children’s children and so on.

Answered By: Calder White

Answer 2

scrape_link = 'https://www.imdb.com/chart/top/'
page = requests.get(scrape_link)


soup = BeautifulSoup(page.content, 'html.parser')
    
movie_data = {}
links = soup.select('div.article div.lister table.chart.full-width tbody.lister-list tr td.titleColumn a')
for anchor in links:
    movie_title = anchor.get_text()
    movie_link = "https://www.imdb.com" + anchor['href']

also, page.content prints the html

Answered By: Dennis Ungureanu

Other parsed contents of a static HTML webpage?

Question:

Answers: