jquery-like HTML parsing in Python?

Question:

Is there any way in Python that would allow me to parse an HTML document similar to what jQuery does?

i.e. I’d like to be able to use CSS selectors syntax to grab an arbitrary set of nodes from the document, read their content/attributes, etc.

Asked By: Roy Tang

||

Answers:

The lxml library supports CSS selectors.

If you are fluent with BeautifulSoup, you could just add soupselect to your libs.
Soupselect is a CSS selector extension for BeautifulSoup.

Usage:

from bs4 import BeautifulSoup as Soup
from soupselect import select
import urllib
soup = Soup(urllib.urlopen('http://slashdot.org/'))
select(soup, 'div.title h3')
    [<h3><span><a href='//science.slashdot.org/'>Science</a>:</span></h3>,
     <h3><a href='//slashdot.org/articles/07/02/28/0120220.shtml'>Star Trek</h3>,
    ..]
Answered By: systempuntoout

Consider PyQuery:

http://packages.python.org/pyquery/

>>> from pyquery import PyQuery as pq
>>> from lxml import etree
>>> import urllib
>>> d = pq("<html></html>")
>>> d = pq(etree.fromstring("<html></html>"))
>>> d = pq(url='http://google.com/')
>>> d = pq(url='http://google.com/', opener=lambda url: urllib.urlopen(url).read())
>>> d = pq(filename=path_to_html_file)
>>> d("#hello")
[<p#hello.hello>]
>>> p = d("#hello")
>>> p.html()
'Hello world !'
>>> p.html("you know <a href='http://python.org/'>Python</a> rocks")
[<p#hello.hello>]
>>> p.html()
u'you know <a href="http://python.org/">Python</a> rocks'
>>> p.text()
'you know Python rocks'
Answered By: Luke Stanley

BeautifulSoup, now has support for css selectors

import requests
from bs4 import BeautifulSoup as Soup
html = requests.get('https://stackoverflow.com/questions/3051295').content
soup = Soup(html)

Title of this question

soup.select('h1.grid--cell :first-child')[0].text

Number of question upvotes

# first item 
soup.select_one('[itemprop="upvoteCount"]').text

using Python Requests to get the html page

Answered By: imbr