"CData section too big" error when parsing XML with lxml

Question:

I’m trying to parse a XML file using etree, parsing the file with

tree = etree.parse(path_to_xml)

is giving me an error :

lxml.etree.XMLSyntaxError : CData section too big

So I’ve tried to remove all the CData tags but in order to read it I need to parse so that solution is pointless.

Deleting the CData tags would do the trick, I’ve tried using regex for that but it’s a risky fix.

Also I cannot share the file as it’s confidential but for I’ve talked with my colleages, maybe the error it’s due the CData length. XML file is like 30MB so it’s not memory related, any idea? Thanks!!

Answers:

There seems to be a limitation of the xml2 library, by default at 10Mb: Comment in another forum by Gustavo L Fabro I know this is a python question, but both seem to use the same underlying C library.

When defining the tree parser, there is an option called huge_tree that might help you out: XML Tree parser docs

You can even see the actual limitation in the C library here: Debian Archive Mail list , search in it "XML_PARSE_HUGE"

Good luck!

Answered By: Alexandro Nadal
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.