How to prevent lxml from converting '&' character to '&'?

Question:

I need to send the control characters 
 and 
 in my XML file so that the text is displayed correctly in the target system.

For the creation of the XML file I use the lxml library. This is my attempt:

from lxml import etree as et
import lxml.builder

e = lxml.builder.ElementMaker()

xml_doc = e.newOrderRequest(
    e.Orders(
        e.Order(
            e.OrderNumber('12345'),
            e.OrderID('001'),
            e.Articles(
                e.Article(
                    e.ArticleNumber('000111'),
                    e.ArticleName('Logitec Mouse'),
                    e.ArticleDescription('* 4 Buttons
* 600 DPI
* Bluetooth')
                )
            )
        )
    )
)

tree = et.ElementTree(xml_doc)
tree.write('output.xml', pretty_print=True, xml_declaration=True, encoding="utf-8")

This is the result:

<?xml version='1.0' encoding='UTF-8'?>
<newOrderRequest>
  <Orders>
    <Order>
      <OrderNumber>12345</OrderNumber>
      <OrderID>001</OrderID>
      <Articles>
        <Article>
          <ArticleNumber>000111</ArticleNumber>
          <ArticleName>Logitec Mouse</ArticleName>
          <ArticleDescription>* 4 Buttons&amp;#x0D;&amp;#x0A;* 600 DPI&amp;#x0D;&amp;#x0A;* Bluetooth</ArticleDescription>
        </Article>
      </Articles>
    </Order>
  </Orders>
</newOrderRequest>

This is what I need:

<ArticleDescription>* 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth</ArticleDescription>

Is there a function in the lxml library to turn off the conversion or does anyone know a way to solve this problem? Thanks in advance.

Asked By: Krupniok

||

Answers:

This is not a python or lxml issue – it is how XML parsers and serializers work.
If you want to use a specific character in your programming language, then make it that character. The serializer will convert it into an entity reference if required, and the parser will convert it back when reading the document. You cannot turn it off – it would be against the specification.

See https://www.w3.org/TR/REC-xml/#syntax:
The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively.

An exception might be to use a CDATA section as explained in What does <![CDATA[]]> in XML mean?

Answered By: Queeg

The output of the Python script:

import lxml.etree as et
print(repr(et.fromstring('''<ArticleDescription>* 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth</ArticleDescription>''').text))

…is…

'* 4 Buttonsrn* 600 DPIrn* Bluetooth'

That means that the Python-syntax way to write the XML-syntax string * 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth is as '* 4 Buttonsrn* 600 DPIrn* Bluetooth'.

Thus, the relevant line of code should be:

e.ArticleDescription('* 4 Buttonsrn* 600 DPIrn* Bluetooth')

…and if the consumer doesn’t treat the resulting output as exactly identical to import lxml.etree as et print(repr(et.fromstring('''<ArticleDescription>* 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth</ArticleDescription>, that consumer is broken.

See https://replit.com/@CharlesDuffy2/ImportantClassicConversion#test.py running your code with the modification suggested above.

Answered By: Charles Duffy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.