Python and HTMLParser.handle_data() – How to get data from tags?


I’m trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I’m not sure how to do it. This is the code I have so far:

import urllib.request
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Encountered   some data:", data)

url = "website"
page = urllib.request.urlopen(url).read()

parser = MyHTMLParser(strict=False)

If I understand correctly, I can use the handle_data() function to get the data between tags. How do I specify which tags to get the data from? And how do I get the data?

Asked By: user1049697



html_code = urllib2.urlopen("xxx")
html_code_list = html_code.readlines()
data = ""
for line in html_code_list:
    line = line.strip()

    if line.startswith("<h2"):
       data = data+line

hp = MyHTMLParser()

thus you can extract data from h2 tag, hope it can help

Answered By: Yanan

I don’t have time to format/clean this up it but this is how I usually do it:

        class HTMLParse(HTMLParser.HTMLParser):
            def handle_starttag(self, tag, attr):
                if tag.lower() == "a":
                    for item in attr:
                        #print item
                        if item[0].lower() == "href":
                            path = urlparse.urlparse(item[1]).path
                            ext = os.path.splitext(path)[1]
                            if ext.lower() in (".jpeg", ".jpg", ".png",
                                print "Found: "+ item[1]
Answered By: user393899
class HTMLParse(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'h2':
            self.recordh2 = True
    def handle_endtag(self, tag, attrs):
        if tag == 'h2':
            self.recordh2 = False
    def handle_data(self, data):
        if self.recordh2:
            # do your work here
Answered By: hwang
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.