lxml I can't parse, and I'm getting a lot of different errors using etree.fromstring()
Question:
I’m trying to use the parse function of lxml and each time a different error appears. I thought the problem was on the website, but when I tried to use it on google and wikipedia it didn’t work either!!
Can someone help me? If each time a different error appears, is the problem with my environment or is it with the program?
I’m running this code:
driver.get('https://www.google.com.br/')
data = driver.page_source
tree = etree.fromstring(data)
With google this error is appearing:
File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 2
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 2, column 55
With wikipedia this error is appearing:
File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 19
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 19 and head, line 19, column 57
And with the site I wanted the lxml to work, this error appears:
File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 5
lxml.etree.XMLSyntaxError: error parsing attribute name, line 5, column 301
If anyone also knows an alternative to lxml I would be grateful as well.
Answers:
As John Gordon mentioned this is not xml, its html so you have to parse it as html.
Try this:
from lxml import html
driver.get('https://www.google.com.br/')
data = driver.page_source
tree = html.fromstring(data)
I’m trying to use the parse function of lxml and each time a different error appears. I thought the problem was on the website, but when I tried to use it on google and wikipedia it didn’t work either!!
Can someone help me? If each time a different error appears, is the problem with my environment or is it with the program?
I’m running this code:
driver.get('https://www.google.com.br/')
data = driver.page_source
tree = etree.fromstring(data)
With google this error is appearing:
File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 2
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 2, column 55
With wikipedia this error is appearing:
File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 19
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 19 and head, line 19, column 57
And with the site I wanted the lxml to work, this error appears:
File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 5
lxml.etree.XMLSyntaxError: error parsing attribute name, line 5, column 301
If anyone also knows an alternative to lxml I would be grateful as well.
As John Gordon mentioned this is not xml, its html so you have to parse it as html.
Try this:
from lxml import html
driver.get('https://www.google.com.br/')
data = driver.page_source
tree = html.fromstring(data)