Validate HTML with BeautifulSoup
Question:
I use BeautifulSoup 3.2.1 to parse a lot of HTML files translated with eTranslation.
I found
soup = BeautifulSoup(html_file, "html.parser")
sometimes cuts a section of my HTML file. And it is related to invalid tags or problems found in the HTML.
Also I found soup = BeautifulSoup(html_file, "lxml")
works better in these cases of bad written HTML.
Is there a way to detect which HTML file is invalid using BeautifulSoup?
I image something like this:
if valid(html_file):
soup = BeautifulSoup(html_file, "html.parser")
else:
soup = BeautifulSoup(html_file, "lxml")
Answers:
I solved it using lxml all the time.
I use BeautifulSoup 3.2.1 to parse a lot of HTML files translated with eTranslation.
I found
soup = BeautifulSoup(html_file, "html.parser")
sometimes cuts a section of my HTML file. And it is related to invalid tags or problems found in the HTML.
Also I found soup = BeautifulSoup(html_file, "lxml")
works better in these cases of bad written HTML.
Is there a way to detect which HTML file is invalid using BeautifulSoup?
I image something like this:
if valid(html_file):
soup = BeautifulSoup(html_file, "html.parser")
else:
soup = BeautifulSoup(html_file, "lxml")
I solved it using lxml all the time.