Validate HTML with BeautifulSoup

Question

I use BeautifulSoup 3.2.1 to parse a lot of HTML files translated with eTranslation.

I found
soup = BeautifulSoup(html_file, "html.parser") sometimes cuts a section of my HTML file. And it is related to invalid tags or problems found in the HTML.

Also I found soup = BeautifulSoup(html_file, "lxml") works better in these cases of bad written HTML.

Is there a way to detect which HTML file is invalid using BeautifulSoup?

I image something like this:

if valid(html_file):
    soup = BeautifulSoup(html_file, "html.parser")
else:
    soup = BeautifulSoup(html_file, "lxml")

Asked By: GhitaB

||

Source

Answer 1

I solved it using lxml all the time.

Answered By: GhitaB

Validate HTML with BeautifulSoup

Question:

Answers: