BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?
Question:
When using Beautiful Soup what is the difference between ‘lxml’ and "html.parser" and "html5lib"?
When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I should be using a different one. I’d like to strengthen my understanding; I’ve read a couple posts on here about this but they’re not going over the uses much in any at all.
Example:
soup = BeautifulSoup(response.text, 'lxml')
Answers:
The key differences are highlighted in the BeautifulSoup documentation:
The basic reasoning why would you prefer one parser instead of others:
html.parser
– built-in – no extra dependencies needed
html5lib
– the most lenient – better use it if HTML is broken
lxml
– the fastest
From the docs‘s summarized table of advantages and disadvantages:
-
html.parser – BeautifulSoup(markup, "html.parser")
-
Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)
-
Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
-
lxml – BeautifulSoup(markup, "lxml")
-
Advantages: Very fast, Lenient
-
Disadvantages: External C dependency
-
html5lib – BeautifulSoup(markup, "html5lib")
-
Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
-
Disadvantages: Very slow, External Python dependency
When using Beautiful Soup what is the difference between ‘lxml’ and "html.parser" and "html5lib"?
When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I should be using a different one. I’d like to strengthen my understanding; I’ve read a couple posts on here about this but they’re not going over the uses much in any at all.
Example:
soup = BeautifulSoup(response.text, 'lxml')
The key differences are highlighted in the BeautifulSoup documentation:
The basic reasoning why would you prefer one parser instead of others:
html.parser
– built-in – no extra dependencies neededhtml5lib
– the most lenient – better use it if HTML is brokenlxml
– the fastest
From the docs‘s summarized table of advantages and disadvantages:
-
html.parser –
BeautifulSoup(markup, "html.parser")
-
Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)
-
Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
-
-
lxml –
BeautifulSoup(markup, "lxml")
-
Advantages: Very fast, Lenient
-
Disadvantages: External C dependency
-
-
html5lib –
BeautifulSoup(markup, "html5lib")
-
Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
-
Disadvantages: Very slow, External Python dependency
-