python lxml save question with special character in string

Question:

Updated my question, did not realize it had formatted my text

When I save my xml using lxml it converts & to &

When I debug and pull that value after adding it, it’s correct, but when it saves it adds the extra amp; to the xml file. Do I need to do anything specific when saving string with special character in it? I tried converting my xml to string format first, then saving and that gave me the same results.

Example code: The string I’m writing is coming from a excel file. I read it from there and save it. This example skips the excel part of the code.

from lxml import etree
import os

root = etree.Element('root')
child1 = etree.SubElement(root, 'stuff')
child1.set('example', 'Example text & From excel file')

et = etree.ElementTree(root)
et.write(os.path.join(os.path.curdir, 'output.xml'), 
pretty_print=True)

Here is the output, instead of saving Example text & From excel file, it saves Example text & From excel file

<root>
  <stuff example="Example text &amp;amp; From excel file"/>
</root>
Asked By: user1904898

||

Answers:

Assuming you are reading in this way:

root = etree.fromstring("%s" % in_xml,parser=etree.XMLParser(recover=True))

This is a common error message with lxml. The solution is to convert the string to unicode before using it with lxml. To do that, you’ll need to know the encoding, but a guess of UTF-8 is very often correct if you don’t happen to know.

in_xml_unicode = unicode(in_xml, 'utf-8')
root = etree.fromstring(in_xml_unicode, parser=etree.XMLParser(recover=True))

If this doesnot solve your problem Check out this post here.

Look at the answer of @XAnguera. He is reading the data and replacing the &amp; with &amp;&amp; you can try doing the opposite way of this.

What i mean is that you replace the &amp;&amp; with the &amp;.
This is a very cheap way to solve the issue i know but it does work.

Answered By: Drystan

It’s due to the escaping of xml.

So, &amp; will convert to &amp;amp; (& to &amp; rest of string parse as it is.)

The solution will be you can unescape the string you write into the child1.

import os
from lxml import etree
from xml.sax.saxutils import unescape

root = etree.Element('root')
child1 = etree.SubElement(root, 'stuff')

# unescaped the string you are parsing 
child1.set('example', unescape('Example text &amp; From excel file'))

et = etree.ElementTree(root)
et.write(os.path.join(os.path.curdir, 'output.xml'), 
pretty_print=True)

The result will be:

<root>
  <stuff example="Example text &amp; From excel file"/>
</root>
Answered By: Rahul K P
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.