lxml I can't parse, and I'm getting a lot of different errors using etree.fromstring()

Question:

I’m trying to use the parse function of lxml and each time a different error appears. I thought the problem was on the website, but when I tried to use it on google and wikipedia it didn’t work either!!

Can someone help me? If each time a different error appears, is the problem with my environment or is it with the program?

I’m running this code:

driver.get('https://www.google.com.br/') 
data = driver.page_source
tree = etree.fromstring(data)

With google this error is appearing:

  File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
  File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
  File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
  File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 2
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 2, column 55

With wikipedia this error is appearing:

  File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
  File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
  File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
  File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 19
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 19 and head, line 19, column 57

And with the site I wanted the lxml to work, this error appears:

  File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
  File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
  File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
  File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 5
lxml.etree.XMLSyntaxError: error parsing attribute name, line 5, column 301

If anyone also knows an alternative to lxml I would be grateful as well.

Asked By: RaymanSix

||

Answers:

As John Gordon mentioned this is not xml, its html so you have to parse it as html.

Try this:

from lxml import html

driver.get('https://www.google.com.br/') 
data = driver.page_source
tree = html.fromstring(data)
Answered By: SDAO