Pretty printing XML in Python

Question:

What is the best way (or are the various ways) to pretty print XML in Python?

Asked By: Hortitude

||

Answers:

XML pretty print for python looks pretty good for this task. (Appropriately named, too.)

An alternative is to use pyXML, which has a PrettyPrint function.

Answered By: Dan Lew

lxml is recent, updated, and includes a pretty print function

import lxml.etree as etree

x = etree.parse("filename")
print etree.tostring(x, pretty_print=True)

Check out the lxml tutorial:
https://lxml.de/tutorial.html

Answered By: 1729

If you’re using a DOM implementation, each has their own form of pretty-printing built-in:

# minidom
#
document.toprettyxml()

# 4DOM
#
xml.dom.ext.PrettyPrint(document, stream)

# pxdom (or other DOM Level 3 LS-compliant imp)
#
serializer.domConfig.setParameter('format-pretty-print', True)
serializer.writeToString(document)

If you’re using something else without its own pretty-printer — or those pretty-printers don’t quite do it the way you want —  you’d probably have to write or subclass your own serialiser.

Answered By: bobince

I had some problems with minidom’s pretty print. I’d get a UnicodeError whenever I tried pretty-printing a document with characters outside the given encoding, eg if I had a β in a document and I tried doc.toprettyxml(encoding='latin-1'). Here’s my workaround for it:

def toprettyxml(doc, encoding):
    """Return a pretty-printed XML document in a given encoding."""
    unistr = doc.toprettyxml().replace(u'<?xml version="1.0" ?>',
                          u'<?xml version="1.0" encoding="%s"?>' % encoding)
    return unistr.encode(encoding, 'xmlcharrefreplace')
Answered By: giltay
import xml.dom.minidom

dom = xml.dom.minidom.parse(xml_fname) # or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = dom.toprettyxml()
Answered By: Ben Noland

Here’s my (hacky?) solution to get around the ugly text node problem.

uglyXml = doc.toprettyxml(indent='  ')

text_re = re.compile('>ns+([^<>s].*?)ns+</', re.DOTALL)    
prettyXml = text_re.sub('>g<1></', uglyXml)

print prettyXml

The above code will produce:

<?xml version="1.0" ?>
<issues>
  <issue>
    <id>1</id>
    <title>Add Visual Studio 2005 and 2008 solution files</title>
    <details>We need Visual Studio 2005/2008 project files for Windows.</details>
  </issue>
</issues>

Instead of this:

<?xml version="1.0" ?>
<issues>
  <issue>
    <id>
      1
    </id>
    <title>
      Add Visual Studio 2005 and 2008 solution files
    </title>
    <details>
      We need Visual Studio 2005/2008 project files for Windows.
    </details>
  </issue>
</issues>

Disclaimer: There are probably some limitations.

Answered By: Nick Bolton

Another solution is to borrow this indent function, for use with the ElementTree library that’s built in to Python since 2.5.
Here’s what that would look like:

from xml.etree import ElementTree

def indent(elem, level=0):
    i = "n" + level*"  "
    j = "n" + (level-1)*"  "
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for subelem in elem:
            indent(subelem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = j
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = j
    return elem        

root = ElementTree.parse('/tmp/xmlfile').getroot()
indent(root)
ElementTree.dump(root)
Answered By: ade

As others pointed out, lxml has a pretty printer built in.

Be aware though that by default it changes CDATA sections to normal text, which can have nasty results.

Here’s a Python function that preserves the input file and only changes the indentation (notice the strip_cdata=False). Furthermore it makes sure the output uses UTF-8 as encoding instead of the default ASCII (notice the encoding='utf-8'):

from lxml import etree

def prettyPrintXml(xmlFilePathToPrettyPrint):
    assert xmlFilePathToPrettyPrint is not None
    parser = etree.XMLParser(resolve_entities=False, strip_cdata=False)
    document = etree.parse(xmlFilePathToPrettyPrint, parser)
    document.write(xmlFilePathToPrettyPrint, pretty_print=True, encoding='utf-8')

Example usage:

prettyPrintXml('some_folder/some_file.xml')
Answered By: roskakori

If you have xmllint you can spawn a subprocess and use it. xmllint --format <file> pretty-prints its input XML to standard output.

Note that this method uses an program external to python, which makes it sort of a hack.

def pretty_print_xml(xml):
    proc = subprocess.Popen(
        ['xmllint', '--format', '/dev/stdin'],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
    )
    (output, error_output) = proc.communicate(xml);
    return output

print(pretty_print_xml(data))
Answered By: Russell Silva

I tried to edit "ade"s answer above, but Stack Overflow wouldn’t let me edit after I had initially provided feedback anonymously. This is a less buggy version of the function to pretty-print an ElementTree.

def indent(elem, level=0, more_sibs=False):
    i = "n"
    if level:
        i += (level-1) * '  '
    num_kids = len(elem)
    if num_kids:
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
            if level:
                elem.text += '  '
        count = 0
        for kid in elem:
            indent(kid, level+1, count < num_kids - 1)
            count += 1
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
            if more_sibs:
                elem.tail += '  '
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i
            if more_sibs:
                elem.tail += '  '
Answered By: Joshua Richardson

I solved this with some lines of code, opening the file, going trough it and adding indentation, then saving it again. I was working with small xml files, and did not want to add dependencies, or more libraries to install for the user. Anyway, here is what I ended up with:

f = open(file_name,'r')
xml = f.read()
f.close()

#Removing old indendations
raw_xml = ''        
for line in xml:
    raw_xml += line

xml = raw_xml

new_xml = ''
indent = '    '
deepness = 0

for i in range((len(xml))):

    new_xml += xml[i]   
    if(i<len(xml)-3):

        simpleSplit = xml[i:(i+2)] == '><'
        advancSplit = xml[i:(i+3)] == '></'        
        end = xml[i:(i+2)] == '/>'    
        start = xml[i] == '<'

        if(advancSplit):
            deepness += -1
            new_xml += 'n' + indent*deepness
            simpleSplit = False
            deepness += -1
        if(simpleSplit):
            new_xml += 'n' + indent*deepness
        if(start):
            deepness += 1
        if(end):
            deepness += -1

f = open(file_name,'w')
f.write(new_xml)
f.close()

It works for me, perhaps someone will have some use of it 🙂

Answered By: Petter TB
from yattag import indent

pretty_string = indent(ugly_string)

It won’t add spaces or newlines inside text nodes, unless you ask for it with:

indent(mystring, indent_text = True)

You can specify what the indentation unit should be and what the newline should look like.

pretty_xml_string = indent(
    ugly_xml_string,
    indentation = '    ',
    newline = 'rn'
)

The doc is on http://www.yattag.org homepage.

Answered By: John Smith Optional

I had this problem and solved it like this:

def write_xml_file (self, file, xml_root_element, xml_declaration=False, pretty_print=False, encoding='unicode', indent='t'):
    pretty_printed_xml = etree.tostring(xml_root_element, xml_declaration=xml_declaration, pretty_print=pretty_print, encoding=encoding)
    if pretty_print: pretty_printed_xml = pretty_printed_xml.replace('  ', indent)
    file.write(pretty_printed_xml)

In my code this method is called like this:

try:
    with open(file_path, 'w') as file:
        file.write('<?xml version="1.0" encoding="utf-8" ?>')

        # create some xml content using etree ...

        xml_parser = XMLParser()
        xml_parser.write_xml_file(file, xml_root, xml_declaration=False, pretty_print=True, encoding='unicode', indent='t')

except IOError:
    print("Error while writing in log file!")

This works only because etree by default uses two spaces to indent, which I don’t find very much emphasizing the indentation and therefore not pretty. I couldn’t ind any setting for etree or parameter for any function to change the standard etree indent. I like how easy it is to use etree, but this was really annoying me.

Answered By: Zelphir Kaltstahl

I wrote a solution to walk through an existing ElementTree and use text/tail to indent it as one typically expects.

def prettify(element, indent='  '):
    queue = [(0, element)]  # (level, element)
    while queue:
        level, element = queue.pop(0)
        children = [(level + 1, child) for child in list(element)]
        if children:
            element.text = 'n' + indent * (level+1)  # for child open
        if queue:
            element.tail = 'n' + indent * queue[0][0]  # for sibling open
        else:
            element.tail = 'n' + indent * (level-1)  # for parent close
        queue[0:0] = children  # prepend so children come before siblings
Answered By: nacitar sevaht

You can use popular external library xmltodict, with unparse and pretty=True you will get best result:

xmltodict.unparse(
    xmltodict.parse(my_xml), full_document=False, pretty=True)

full_document=False against <?xml version="1.0" encoding="UTF-8"?> at the top.

Answered By: Vitaly Zdanevich

You have a few options.

xml.etree.ElementTree.indent()

Batteries included, simple to use, pretty output.

But requires Python 3.9+

import xml.etree.ElementTree as ET

element = ET.XML("<html><body>text</body></html>")
ET.indent(element)
print(ET.tostring(element, encoding='unicode'))

BeautifulSoup.prettify()

BeautifulSoup may be the simplest solution for Python < 3.9.

from bs4 import BeautifulSoup

bs = BeautifulSoup(open(xml_file), 'xml')
pretty_xml = bs.prettify()
print(pretty_xml)

Output:

<?xml version="1.0" encoding="utf-8"?>
<issues>
 <issue>
  <id>
   1
  </id>
  <title>
   Add Visual Studio 2005 and 2008 solution files
  </title>
 </issue>
</issues>

This is my goto answer. The default arguments work as is. But text contents are spread out on separate lines as if they were nested elements.

lxml.etree.parse()

Prettier output but with arguments.

from lxml import etree

x = etree.parse(FILE_NAME)
pretty_xml = etree.tostring(x, pretty_print=True, encoding=str)

Produces:

  <issues>
    <issue>
      <id>1</id>
      <title>Add Visual Studio 2005 and 2008 solution files</title>
      <details>We need Visual Studio 2005/2008 project files for Windows.</details>
    </issue>
  </issues>

This works for me with no issues.


xml.dom.minidom.parse()

No external dependencies but post-processing.

import xml.dom.minidom as md

dom = md.parse(FILE_NAME)     
# To parse string instead use: dom = md.parseString(xml_string)
pretty_xml = dom.toprettyxml()
# remove the weird newline issue:
pretty_xml = os.linesep.join([s for s in pretty_xml.splitlines()
                              if s.strip()])

The output is the same as above, but it’s more code.

Answered By: ChaimG

Take a look at the vkbeautify module.

It is a python version of my very popular javascript/nodejs plugin with the same name. It can pretty-print/minify XML, JSON and CSS text. Input and output can be string/file in any combinations. It is very compact and doesn’t have any dependency.

Examples:

import vkbeautify as vkb

vkb.xml(text)                       
vkb.xml(text, 'path/to/dest/file')  
vkb.xml('path/to/src/file')        
vkb.xml('path/to/src/file', 'path/to/dest/file') 
Answered By: vadimk

An alternative if you don’t want to have to reparse, there is the xmlpp.py library with the get_pprint() function. It worked nice and smoothly for my use cases, without having to reparse to an lxml ElementTree object.

Answered By: gaborous

For converting an entire xml document to a pretty xml document
(ex: assuming you’ve extracted [unzipped] a LibreOffice Writer .odt or .ods file, and you want to convert the ugly "content.xml" file to a pretty one for automated git version control and git difftooling of .odt/.ods files, such as I’m implementing here)

import xml.dom.minidom

file = open("./content.xml", 'r')
xml_string = file.read()
file.close()

parsed_xml = xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = parsed_xml.toprettyxml()

file = open("./content_new.xml", 'w')
file.write(pretty_xml_as_string)
file.close()

References:

Answered By: Gabriel Staples
from lxml import etree
import xml.dom.minidom as mmd

xml_root = etree.parse(xml_fiel_path, etree.XMLParser())

def print_xml(xml_root):
    plain_xml = etree.tostring(xml_root).decode('utf-8')
    urgly_xml = ''.join(plain_xml .split())
    good_xml = mmd.parseString(urgly_xml)
    print(good_xml.toprettyxml(indent='    ',))

It’s working well for the xml with Chinese!

Answered By: Reed_Xia

You can try this variation…

Install BeautifulSoup and the backend lxml (parser) libraries:

user$ pip3 install lxml bs4

Process your XML document:

from bs4 import BeautifulSoup

with open('/path/to/file.xml', 'r') as doc: 
    for line in doc: 
        print(BeautifulSoup(line, 'lxml-xml').prettify())  
Answered By: NYCeyes

Here’s a Python3 solution that gets rid of the ugly newline issue (tons of whitespace), and it only uses standard libraries unlike most other implementations.

import xml.etree.ElementTree as ET
import xml.dom.minidom
import os

def pretty_print_xml_given_root(root, output_xml):
    """
    Useful for when you are editing xml data on the fly
    """
    xml_string = xml.dom.minidom.parseString(ET.tostring(root)).toprettyxml()
    xml_string = os.linesep.join([s for s in xml_string.splitlines() if s.strip()]) # remove the weird newline issue
    with open(output_xml, "w") as file_out:
        file_out.write(xml_string)

def pretty_print_xml_given_file(input_xml, output_xml):
    """
    Useful for when you want to reformat an already existing xml file
    """
    tree = ET.parse(input_xml)
    root = tree.getroot()
    pretty_print_xml_given_root(root, output_xml)

I found how to fix the common newline issue here.

Answered By: Josh Correia

If for some reason you can’t get your hands on any of the Python modules that other users mentioned, I suggest the following solution for Python 2.7:

import subprocess

def makePretty(filepath):
  cmd = "xmllint --format " + filepath
  prettyXML = subprocess.check_output(cmd, shell = True)
  with open(filepath, "w") as outfile:
    outfile.write(prettyXML)

As far as I know, this solution will work on Unix-based systems that have the xmllint package installed.

Answered By: FriskySaga

As of Python 3.9, ElementTree has an indent() function for pretty-printing XML trees.

See https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.indent.

Sample usage:

import xml.etree.ElementTree as ET

element = ET.XML("<html><body>text</body></html>")
ET.indent(element)
print(ET.tostring(element, encoding='unicode'))

The upside is that it does not require any additional libraries. For more information check https://bugs.python.org/issue14465 and https://github.com/python/cpython/pull/15200

Answered By: o15a3d4l11s2

I found this question while looking for "how to pretty print html"

Using some of the ideas in this thread I adapted the XML solutions to work for XML or HTML:

from xml.dom.minidom import parseString as string_to_dom

def prettify(string, html=True):
    dom = string_to_dom(string)
    ugly = dom.toprettyxml(indent="  ")
    split = list(filter(lambda x: len(x.strip()), ugly.split('n')))
    if html:
        split = split[1:]
    pretty = 'n'.join(split)
    return pretty

def pretty_print(html):
    print(prettify(html))

When used this is what it looks like:

html = """
<div class="foo" id="bar"><p>'IDK!'</p><br/><div class='baz'><div>
<span>Hi</span></div></div><p id='blarg'>Try for 2</p>
<div class='baz'>Oh No!</div></div>
"""

pretty_print(html)

Which returns:

<div class="foo" id="bar">
  <p>'IDK!'</p>
  <br/>
  <div class="baz">
    <div>
      <span>Hi</span>
    </div>
  </div>
  <p id="blarg">Try for 2</p>
  <div class="baz">Oh No!</div>
</div>
Answered By: emehex

Use etree.indent and etree.tostring

import lxml.etree as etree

root = etree.fromstring('<html><head></head><body><h1>Welcome</h1></body></html>')
etree.indent(root, space="  ")
xml_string = etree.tostring(root, pretty_print=True).decode()
print(xml_string)

output

<html>
  <head/>
  <body>
    <h1>Welcome</h1>
  </body>
</html>

Removing namespaces and prefixes

import lxml.etree as etree


def dump_xml(element):
    for item in element.getiterator():
        item.tag = etree.QName(item).localname

    etree.cleanup_namespaces(element)
    etree.indent(element, space="  ")
    result = etree.tostring(element, pretty_print=True).decode()
    return result


root = etree.fromstring('<cs:document ><document>
  <name>hello world</name>
</document>

I found a fast and easy way to nicely format and print an xml file:

import xml.etree.ElementTree as ET

xmlTree = ET.parse('your XML file')
xmlRoot = xmlTree.getroot()
xmlDoc =  ET.tostring(xmlRoot, encoding="unicode")

print(xmlDoc)

Outuput:

<root>
  <child>
    <subchild>.....</subchild>
  </child>
  <child>
    <subchild>.....</subchild>
  </child>
  ...
  ...
  ...
  <child>
    <subchild>.....</subchild>
  </child>
</root>
Answered By: Wuagliono
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.