Regex to extract URLs from href attribute in HTML with Python

Question:

Possible Duplicate:
What is the best regular expression to check if a string is a valid URL?

Considering a string as follows:

string = "<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://2.example">Even More Examples</a>"

How could I, with Python, extract the URLs, inside the anchor tag’s href? Something like:

>>> url = getURLs(string)
>>> url
['http://example.com', 'http://2.example']
Asked By: user825286

||

Answers:

import re

url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://2.example">Even More Examples</a>'

urls = re.findall('https?://(?:[-w.]|(?:%[da-fA-F]{2}))+', url)

>>> print urls
['http://example.com', 'http://2.example']
Answered By: JohnJohnGa

The best answer is…

Don’t use a regex

The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don’t really want it after all. The most correct version is ten-thousand characters long.

Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the URL, inside the anchor tag’s href." Why use a ten-thousand-character-long regex when you can do something much simpler?

Parse the HTML instead

For many tasks, using Beautiful Soup will be far faster and easier to use:

>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser')           # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://2.example']

If you prefer not to use external tools, you can also directly use Python’s own built-in HTML parsing library. Here’s a really simple subclass of HTMLParser that does exactly what you want:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self, output_list=None):
        HTMLParser.__init__(self)
        if output_list is None:
            self.output_list = []
        else:
            self.output_list = output_list
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.output_list.append(dict(attrs).get('href'))

Test:

>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://2.example']

You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from HTML.

Answered By: senderle
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.