BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?


When using Beautiful Soup what is the difference between ‘lxml’ and "html.parser" and "html5lib"?

When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I should be using a different one. I’d like to strengthen my understanding; I’ve read a couple posts on here about this but they’re not going over the uses much in any at all.


soup = BeautifulSoup(response.text, 'lxml')
Asked By: duc hathaway



The key differences are highlighted in the BeautifulSoup documentation:

The basic reasoning why would you prefer one parser instead of others:

  • html.parser– built-in – no extra dependencies needed
  • html5libthe most lenient – better use it if HTML is broken
  • lxmlthe fastest
Answered By: alecxe

From the docs‘s summarized table of advantages and disadvantages:

  1. html.parserBeautifulSoup(markup, "html.parser")

    • Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)

    • Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

  2. lxmlBeautifulSoup(markup, "lxml")

    • Advantages: Very fast, Lenient

    • Disadvantages: External C dependency

  3. html5libBeautifulSoup(markup, "html5lib")

    • Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5

    • Disadvantages: Very slow, External Python dependency