BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?

Question:

When using Beautiful Soup what is the difference between ‘lxml’ and "html.parser" and "html5lib"?

When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I should be using a different one. I’d like to strengthen my understanding; I’ve read a couple posts on here about this but they’re not going over the uses much in any at all.

Example:

soup = BeautifulSoup(response.text, 'lxml')
Asked By: duc hathaway

||

Answers:

The key differences are highlighted in the BeautifulSoup documentation:

The basic reasoning why would you prefer one parser instead of others:

  • html.parser– built-in – no extra dependencies needed
  • html5libthe most lenient – better use it if HTML is broken
  • lxmlthe fastest
Answered By: alecxe

From the docs‘s summarized table of advantages and disadvantages:

  1. html.parserBeautifulSoup(markup, "html.parser")

    • Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)

    • Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

  2. lxmlBeautifulSoup(markup, "lxml")

    • Advantages: Very fast, Lenient

    • Disadvantages: External C dependency

  3. html5libBeautifulSoup(markup, "html5lib")

    • Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5

    • Disadvantages: Very slow, External Python dependency