Using Python Iterparse For Large XML Files
Question:
I need to write a parser in Python that can process some extremely large files ( > 2 GB ) on a computer without much memory (only 2 GB). I wanted to use iterparse in lxml to do it.
My file is of the format:
<item>
<title>Item 1</title>
<desc>Description 1</desc>
</item>
<item>
<title>Item 2</title>
<desc>Description 2</desc>
</item>
and so far my solution is:
from lxml import etree
context = etree.iterparse( MYFILE, tag='item' )
for event, elem in context :
print elem.xpath( 'description/text( )' )
del context
Unfortunately though, this solution is still eating up a lot of memory. I think the problem is that after dealing with each “ITEM” I need to do something to cleanup empty children. Can anyone offer some suggestions on what I might do after processing my data to properly cleanup?
Answers:
Why won’t you use the “callback” approach of sax?
Try Liza Daly’s fast_iter. After processing an element, elem
, it calls elem.clear()
to remove descendants and also removes preceding siblings.
def fast_iter(context, func, *args, **kwargs):
"""
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elem):
print elem.xpath( 'description/text( )' )
context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)
Daly’s article is an excellent read, especially if you are processing large XML files.
Edit: The fast_iter
posted above is a modified version of Daly’s fast_iter
. After processing an element, it is more aggressive at removing other elements that are no longer needed.
The script below shows the difference in behavior. Note in particular that orig_fast_iter
does not delete the A1
element, while the mod_fast_iter
does delete it, thus saving more memory.
import lxml.etree as ET
import textwrap
import io
def setup_ABC():
content = textwrap.dedent('''
<root>
<A1>
<B1></B1>
<C>1<D1></D1></C>
<E1></E1>
</A1>
<A2>
<B2></B2>
<C>2<D></D></C>
<E2></E2>
</A2>
</root>
''')
return content
def study_fast_iter():
def orig_fast_iter(context, func, *args, **kwargs):
for event, elem in context:
print('Processing {e}'.format(e=ET.tostring(elem)))
func(elem, *args, **kwargs)
print('Clearing {e}'.format(e=ET.tostring(elem)))
elem.clear()
while elem.getprevious() is not None:
print('Deleting {p}'.format(
p=(elem.getparent()[0]).tag))
del elem.getparent()[0]
del context
def mod_fast_iter(context, func, *args, **kwargs):
"""
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
Author: Liza Daly
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
print('Processing {e}'.format(e=ET.tostring(elem)))
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
print('Clearing {e}'.format(e=ET.tostring(elem)))
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
print('Checking ancestor: {a}'.format(a=ancestor.tag))
while ancestor.getprevious() is not None:
print(
'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
del ancestor.getparent()[0]
del context
content = setup_ABC()
context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
orig_fast_iter(context, lambda elem: None)
# Processing <C>1<D1/></C>
# Clearing <C>1<D1/></C>
# Deleting B1
# Processing <C>2<D/></C>
# Clearing <C>2<D/></C>
# Deleting B2
print('-' * 80)
"""
The improved fast_iter deletes A1. The original fast_iter does not.
"""
content = setup_ABC()
context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
mod_fast_iter(context, lambda elem: None)
# Processing <C>1<D1/></C>
# Clearing <C>1<D1/></C>
# Checking ancestor: root
# Checking ancestor: A1
# Checking ancestor: C
# Deleting B1
# Processing <C>2<D/></C>
# Clearing <C>2<D/></C>
# Checking ancestor: root
# Checking ancestor: A2
# Deleting A1
# Checking ancestor: C
# Deleting B2
study_fast_iter()
iterparse()
lets you do stuff while building the tree, that means that unless you remove what you don’t need anymore, you’ll still end up with the whole tree in the end.
For more information: read this by the author of the original ElementTree implementation (but it’s also applicable to lxml)
Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:
for event, elem in iterparse(source):
if elem.tag == "record":
... process record elements ...
elem.clear()
The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:
get an iterable
context = iterparse(source, events=("start", "end"))
turn it into an iterator
context = iter(context)
get the root element
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "record":
... process record elements ...
root.clear()
So this is a question of Incremental Parsing , This link can give you detailed answer for summarized answer you can refer the above
The only problem with the root.clear() method is it returns NoneTypes. This means you can’t, for instance, edit what data you parse with string methods like replace() or title(). That said, this is a optimum method to use if you’re just parsing the data as is.
In my experience, iterparse with or without element.clear
(see F. Lundh and L. Daly) cannot always cope with very large XML files: It goes well for some time, suddenly the memory consumption goes through the roof and a memory error occurs or the system crashes. If you encounter the same problem, maybe you can use the same solution: the expat parser. See also F. Lundh or the following example using OP’s XML snippet (plus two umlaute for checking that there are no encoding issues):
import xml.parsers.expat
from collections import deque
def iter_xml(inpath: str, outpath: str) -> None:
def handle_cdata_end():
nonlocal in_cdata
in_cdata = False
def handle_cdata_start():
nonlocal in_cdata
in_cdata = True
def handle_data(data: str):
nonlocal in_cdata
if not in_cdata and open_tags and open_tags[-1] == 'desc':
data = data.replace('\', '\\').replace('n', '\n')
outfile.write(data + 'n')
def handle_endtag(tag: str):
while open_tags:
open_tag = open_tags.pop()
if open_tag == tag:
break
def handle_starttag(tag: str, attrs: 'Dict[str, str]'):
open_tags.append(tag)
open_tags = deque()
in_cdata = False
parser = xml.parsers.expat.ParserCreate()
parser.CharacterDataHandler = handle_data
parser.EndCdataSectionHandler = handle_cdata_end
parser.EndElementHandler = handle_endtag
parser.StartCdataSectionHandler = handle_cdata_start
parser.StartElementHandler = handle_starttag
with open(inpath, 'rb') as infile:
with open(outpath, 'w', encoding = 'utf-8') as outfile:
parser.ParseFile(infile)
iter_xml('input.xml', 'output.txt')
input.xml:
<root>
<item>
<title>Item 1</title>
<desc>Description 1ä</desc>
</item>
<item>
<title>Item 2</title>
<desc>Description 2ü</desc>
</item>
</root>
output.txt:
Description 1ä
Description 2ü
I need to write a parser in Python that can process some extremely large files ( > 2 GB ) on a computer without much memory (only 2 GB). I wanted to use iterparse in lxml to do it.
My file is of the format:
<item>
<title>Item 1</title>
<desc>Description 1</desc>
</item>
<item>
<title>Item 2</title>
<desc>Description 2</desc>
</item>
and so far my solution is:
from lxml import etree
context = etree.iterparse( MYFILE, tag='item' )
for event, elem in context :
print elem.xpath( 'description/text( )' )
del context
Unfortunately though, this solution is still eating up a lot of memory. I think the problem is that after dealing with each “ITEM” I need to do something to cleanup empty children. Can anyone offer some suggestions on what I might do after processing my data to properly cleanup?
Why won’t you use the “callback” approach of sax?
Try Liza Daly’s fast_iter. After processing an element, elem
, it calls elem.clear()
to remove descendants and also removes preceding siblings.
def fast_iter(context, func, *args, **kwargs):
"""
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elem):
print elem.xpath( 'description/text( )' )
context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)
Daly’s article is an excellent read, especially if you are processing large XML files.
Edit: The fast_iter
posted above is a modified version of Daly’s fast_iter
. After processing an element, it is more aggressive at removing other elements that are no longer needed.
The script below shows the difference in behavior. Note in particular that orig_fast_iter
does not delete the A1
element, while the mod_fast_iter
does delete it, thus saving more memory.
import lxml.etree as ET
import textwrap
import io
def setup_ABC():
content = textwrap.dedent('''
<root>
<A1>
<B1></B1>
<C>1<D1></D1></C>
<E1></E1>
</A1>
<A2>
<B2></B2>
<C>2<D></D></C>
<E2></E2>
</A2>
</root>
''')
return content
def study_fast_iter():
def orig_fast_iter(context, func, *args, **kwargs):
for event, elem in context:
print('Processing {e}'.format(e=ET.tostring(elem)))
func(elem, *args, **kwargs)
print('Clearing {e}'.format(e=ET.tostring(elem)))
elem.clear()
while elem.getprevious() is not None:
print('Deleting {p}'.format(
p=(elem.getparent()[0]).tag))
del elem.getparent()[0]
del context
def mod_fast_iter(context, func, *args, **kwargs):
"""
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
Author: Liza Daly
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
print('Processing {e}'.format(e=ET.tostring(elem)))
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
print('Clearing {e}'.format(e=ET.tostring(elem)))
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
print('Checking ancestor: {a}'.format(a=ancestor.tag))
while ancestor.getprevious() is not None:
print(
'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
del ancestor.getparent()[0]
del context
content = setup_ABC()
context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
orig_fast_iter(context, lambda elem: None)
# Processing <C>1<D1/></C>
# Clearing <C>1<D1/></C>
# Deleting B1
# Processing <C>2<D/></C>
# Clearing <C>2<D/></C>
# Deleting B2
print('-' * 80)
"""
The improved fast_iter deletes A1. The original fast_iter does not.
"""
content = setup_ABC()
context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
mod_fast_iter(context, lambda elem: None)
# Processing <C>1<D1/></C>
# Clearing <C>1<D1/></C>
# Checking ancestor: root
# Checking ancestor: A1
# Checking ancestor: C
# Deleting B1
# Processing <C>2<D/></C>
# Clearing <C>2<D/></C>
# Checking ancestor: root
# Checking ancestor: A2
# Deleting A1
# Checking ancestor: C
# Deleting B2
study_fast_iter()
iterparse()
lets you do stuff while building the tree, that means that unless you remove what you don’t need anymore, you’ll still end up with the whole tree in the end.
For more information: read this by the author of the original ElementTree implementation (but it’s also applicable to lxml)
Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:
for event, elem in iterparse(source):
if elem.tag == "record":
... process record elements ...
elem.clear()
The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:
get an iterable
context = iterparse(source, events=("start", "end"))
turn it into an iterator
context = iter(context)
get the root element
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "record":
... process record elements ...
root.clear()
So this is a question of Incremental Parsing , This link can give you detailed answer for summarized answer you can refer the above
The only problem with the root.clear() method is it returns NoneTypes. This means you can’t, for instance, edit what data you parse with string methods like replace() or title(). That said, this is a optimum method to use if you’re just parsing the data as is.
In my experience, iterparse with or without element.clear
(see F. Lundh and L. Daly) cannot always cope with very large XML files: It goes well for some time, suddenly the memory consumption goes through the roof and a memory error occurs or the system crashes. If you encounter the same problem, maybe you can use the same solution: the expat parser. See also F. Lundh or the following example using OP’s XML snippet (plus two umlaute for checking that there are no encoding issues):
import xml.parsers.expat
from collections import deque
def iter_xml(inpath: str, outpath: str) -> None:
def handle_cdata_end():
nonlocal in_cdata
in_cdata = False
def handle_cdata_start():
nonlocal in_cdata
in_cdata = True
def handle_data(data: str):
nonlocal in_cdata
if not in_cdata and open_tags and open_tags[-1] == 'desc':
data = data.replace('\', '\\').replace('n', '\n')
outfile.write(data + 'n')
def handle_endtag(tag: str):
while open_tags:
open_tag = open_tags.pop()
if open_tag == tag:
break
def handle_starttag(tag: str, attrs: 'Dict[str, str]'):
open_tags.append(tag)
open_tags = deque()
in_cdata = False
parser = xml.parsers.expat.ParserCreate()
parser.CharacterDataHandler = handle_data
parser.EndCdataSectionHandler = handle_cdata_end
parser.EndElementHandler = handle_endtag
parser.StartCdataSectionHandler = handle_cdata_start
parser.StartElementHandler = handle_starttag
with open(inpath, 'rb') as infile:
with open(outpath, 'w', encoding = 'utf-8') as outfile:
parser.ParseFile(infile)
iter_xml('input.xml', 'output.txt')
input.xml:
<root>
<item>
<title>Item 1</title>
<desc>Description 1ä</desc>
</item>
<item>
<title>Item 2</title>
<desc>Description 2ü</desc>
</item>
</root>
output.txt:
Description 1ä
Description 2ü