python lxml save question with special character in string
Question:
Updated my question, did not realize it had formatted my text
When I save my xml using lxml it converts & to &
When I debug and pull that value after adding it, it’s correct, but when it saves it adds the extra amp; to the xml file. Do I need to do anything specific when saving string with special character in it? I tried converting my xml to string format first, then saving and that gave me the same results.
Example code: The string I’m writing is coming from a excel file. I read it from there and save it. This example skips the excel part of the code.
from lxml import etree
import os
root = etree.Element('root')
child1 = etree.SubElement(root, 'stuff')
child1.set('example', 'Example text & From excel file')
et = etree.ElementTree(root)
et.write(os.path.join(os.path.curdir, 'output.xml'),
pretty_print=True)
Here is the output, instead of saving Example text & From excel file
, it saves Example text & From excel file
<root>
<stuff example="Example text &amp; From excel file"/>
</root>
Answers:
Assuming you are reading in this way:
root = etree.fromstring("%s" % in_xml,parser=etree.XMLParser(recover=True))
This is a common error message with lxml. The solution is to convert the string to unicode before using it with lxml. To do that, you’ll need to know the encoding, but a guess of UTF-8 is very often correct if you don’t happen to know.
in_xml_unicode = unicode(in_xml, 'utf-8')
root = etree.fromstring(in_xml_unicode, parser=etree.XMLParser(recover=True))
If this doesnot solve your problem Check out this post here.
Look at the answer of @XAnguera. He is reading the data and replacing the &
with &&
you can try doing the opposite way of this.
What i mean is that you replace the &&
with the &
.
This is a very cheap way to solve the issue i know but it does work.
It’s due to the escaping of xml.
So, &
will convert to &amp;
(&
to &
rest of string parse as it is.)
The solution will be you can unescape the string you write into the child1
.
import os
from lxml import etree
from xml.sax.saxutils import unescape
root = etree.Element('root')
child1 = etree.SubElement(root, 'stuff')
# unescaped the string you are parsing
child1.set('example', unescape('Example text & From excel file'))
et = etree.ElementTree(root)
et.write(os.path.join(os.path.curdir, 'output.xml'),
pretty_print=True)
The result will be:
<root>
<stuff example="Example text & From excel file"/>
</root>
Updated my question, did not realize it had formatted my text
When I save my xml using lxml it converts & to &amp;
When I debug and pull that value after adding it, it’s correct, but when it saves it adds the extra amp; to the xml file. Do I need to do anything specific when saving string with special character in it? I tried converting my xml to string format first, then saving and that gave me the same results.
Example code: The string I’m writing is coming from a excel file. I read it from there and save it. This example skips the excel part of the code.
from lxml import etree
import os
root = etree.Element('root')
child1 = etree.SubElement(root, 'stuff')
child1.set('example', 'Example text & From excel file')
et = etree.ElementTree(root)
et.write(os.path.join(os.path.curdir, 'output.xml'),
pretty_print=True)
Here is the output, instead of saving Example text & From excel file
, it saves Example text &amp; From excel file
<root>
<stuff example="Example text &amp; From excel file"/>
</root>
Assuming you are reading in this way:
root = etree.fromstring("%s" % in_xml,parser=etree.XMLParser(recover=True))
This is a common error message with lxml. The solution is to convert the string to unicode before using it with lxml. To do that, you’ll need to know the encoding, but a guess of UTF-8 is very often correct if you don’t happen to know.
in_xml_unicode = unicode(in_xml, 'utf-8')
root = etree.fromstring(in_xml_unicode, parser=etree.XMLParser(recover=True))
If this doesnot solve your problem Check out this post here.
Look at the answer of @XAnguera. He is reading the data and replacing the &
with &&
you can try doing the opposite way of this.
What i mean is that you replace the &&
with the &
.
This is a very cheap way to solve the issue i know but it does work.
It’s due to the escaping of xml.
So, &
will convert to &amp;
(&
to &
rest of string parse as it is.)
The solution will be you can unescape the string you write into the child1
.
import os
from lxml import etree
from xml.sax.saxutils import unescape
root = etree.Element('root')
child1 = etree.SubElement(root, 'stuff')
# unescaped the string you are parsing
child1.set('example', unescape('Example text & From excel file'))
et = etree.ElementTree(root)
et.write(os.path.join(os.path.curdir, 'output.xml'),
pretty_print=True)
The result will be:
<root>
<stuff example="Example text & From excel file"/>
</root>