BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?

Question

When using Beautiful Soup what is the difference between ‘lxml’ and "html.parser" and "html5lib"?

When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I should be using a different one. I’d like to strengthen my understanding; I’ve read a couple posts on here about this but they’re not going over the uses much in any at all.

Example:

soup = BeautifulSoup(response.text, 'lxml')

Asked By: duc hathaway

||

Source

Answer 1

The key differences are highlighted in the BeautifulSoup documentation:

Differences between parsers

The basic reasoning why would you prefer one parser instead of others:

html.parser– built-in – no extra dependencies needed
html5lib – the most lenient – better use it if HTML is broken
lxml – the fastest

Answered By: alecxe

Answer 2

From the docs‘s summarized table of advantages and disadvantages:

html.parser – BeautifulSoup(markup, "html.parser")
- Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)
- Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
lxml – BeautifulSoup(markup, "lxml")
- Advantages: Very fast, Lenient
- Disadvantages: External C dependency
html5lib – BeautifulSoup(markup, "html5lib")
- Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
- Disadvantages: Very slow, External Python dependency

Answered By: Vinícius Figueiredo

BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?

Question:

Answers: