What is the fastest way to extract content from XML document using LXML?

Question

I’m using LXML to extract information from a bunch of XML files. I’m wondering whether the way I’m approaching this task is the most efficient. Right now I use the xpath() method in LXML to identify the specific targets and then use different methods in lxml to extract this information.

As I had noticed in an earlier question (Processing of XML files excruciatingly slow with LXML Python) using etree.parse(file) or etree.parse(file).getroot() is very slow when the files get to be a certain size. They don’t need to be very big, a 12MB xml file is already pretty slow.

What I’m wondering now is whether there is some alternative that might be faster. In the LXML documentation it says that using the XPath class might be faster than using the XPath() method. The problem I’m having is that the XPath class works with Element objects not with ElementTree objects which are what the etree.parse() produce.

All I need is some faster alternative to what I’m doing now, which is basically some variation of what follows. This is just one example of the many scripts of the same kind I use to extract information from the relevant XML files. Just in case you think it is the use of regular expressions that is responsible for the slowness, I’ve done tests where I use the XPath root_element.xpath('//tok[text()="lo"]') and no regex. The time it takes to process the 20 to 30MB files might be a bit less but not by a whole lot. Whatever I do with all those files, if it involves a for loop that checks an XPath expression and does something, it just takes a lot longer than what one would expect when using the latest Python and a Mac with the M1 max chip. I have an older laptop and when I try the same thing it takes 3 days!!!

XMLDIR = "/path_to_dir_with_xml_files"
myCSV_FILE = "/path_to_some_csv_file.csv"

ext = ".xml"


def xml_extract(root_element):

    for el in root_element.xpath('//tok[re:match(., "^[EeLl][LlOoAa][Ss]*$") and not(starts-with(@xpos, "D"))]',
        namespaces={"re": "http://exslt.org/regular-expressions"}): 

        target = el.text
        # allRelevantElements = el.xpath('preceding::tok[position() >= 1 and not(position() > 6)]/following::tok[position() >= 1 and not(position() > 6)]')
        RelevantPrecedingElements = el.xpath(
            "preceding::tok[position() >= 1 and not(position() > 6)]"
        )
        RelevantFollowingElements = el.xpath(
            "following::tok[position() >= 1 and not(position() > 6)]"
        )
        context_list = []

        for elem in RelevantPrecedingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)

        # adjective = '<' + str(el.text) + '>'
        target = f"<{el.text}>"
        print(target)
        context_list.append(target)

        following_context = []
        for elem in RelevantFollowingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            following_context.append(elem_text)

        lema_fol = el.xpath('following::tok[1]')[0].get('lemma') if el.xpath('following::tok[1]') else None
        lema_prec = el.xpath('preceding::tok[1]')[0].get('lemma') if el.xpath('preceding::tok[1]') else None
        xpos_fol = el.xpath('following::tok[1]')[0].get('xpos') if el.xpath('following::tok[1]') else None
        xpos_prec = el.xpath('preceding::tok[1]')[0].get('xpos') if el.xpath('preceding::tok[1]') else None
        form_fol = el.xpath('following::tok[1]')[0].text if el.xpath('following::tok[1]') else None
        form_prec = el.xpath('preceding::tok[1]')[0].text if el.xpath('preceding::tok[1]') else None

        context = " ".join(context_list)
        print(f"Context is: {context}")


        llista = [
            context,
            lema_prec,
            xpos_prec,
            form_prec,
            target,
            lema_fol,
            xpos_fol,
            form_fol,
        ]

        writer = csv.writer(csv_file, delimiter=";")
        writer.writerow(llista)

with open(myCSV_FILE, "a+", encoding="UTF8", newline="") as csv_file:

    for root, dirs, files in os.walk(XMLDIR):

        for file in files:
            if file.endswith(ext):
                file_path = os.path.join(XMLDIR, file)
                file_root = et.parse(file_path).getroot()
                doc = file
                xml_extract(file_root)

Here’s some example of a piece of XML document containing a match for the XPath expression I’m using. The function ‘xml_extract’ would be called on this match and the different pieces of information are correctly extracted and stored into the CSV file. This works fine and does what I want but it is way too slow.

<tok id="w-6387" ord="24" lemma="per" xpos="SPS00">per</tok>
<tok id="w-6388" ord="25" lemma="algun" xpos="DI0FP0">algunes</tok>
<tok id="w-6389" ord="26" lemma="franquesa" xpos="NCFP000">franqueses</tok>
<tok id="w-6390" nform="el" ord="27" lemma="el" xpos="L3MSA">lo</tok>
<tok id="w-6391" ord="28" lemma="haver" xpos="VMIP1S0">hac</tok>

EDIT:

To give some additional and relevant information that might be helpful to @ people trying to help me. The preceding XML content is rather straight forward but the structure of the documents can get rather complicated at times. I’m doing a study on medieval texts and the XML tags in these texts can contain different kinds of information. The ‘tok’ labels contain linguistic annotations which are the ones I’m interested in. In normal circumstances, the XML looks like the preceding sample. In some cases, however, the editors included other tags with metadata about the manuscripts (e.g. whether there was a modification or deletion by a scribe, whether there is a new section or a new page, the title of a section, etc). This can give you a sense of what can be found and perhaps help you understand why I’m using the approach I’m using. Most of the metadata is not relevant for me at this stage. What is relevant is the information contained in the ‘dtok’ tags. These are children tags of ‘tok’ whenever contracted forms have to be decomposed in independent words. This system allows for the visualization of contractions as single words but provides linguistic information about its components. The tagging has been done automatically but it is full of errors. One of the goals I have in extracting information is to be able to detect patterns that might help us improve the linguistic annotation in a semi-automatized way.

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE document SYSTEM "estcorpus.dtd">
<TEI title="Full title" name="Doc_I44">
  <!--this file comes from the Stand-by folder: it needs to be checked because it has inaccurate xml tags-->
  <header>
    <filiation type="obra">Book title</filiation>
    <filiation type="autor">Author name</filiation>
    <filiation type="data">Segle XVIIa</filiation>
    <filiation type="tipologia">Letters</filiation>
    <filiation type="dialecte">Oc:V</filiation>
  </header>
  <text section="logic" lang="català" analyse="norest">
    <pb n="1r" type="folio" id="e-1" />
    <space />
    <space />
    <mark name="empty line" />
    <add>
      <tok form="IX" id="w-384" ord="1" lemma="IX" xpos="Z">IX</tok>
    </add>
    <mark name="lang=Latin" />
    <tok id="w-385" ord="2" lemma="morir" xpos="TMMS">Mort</tok>
    <tok id="w-386" ord="3" lemma="de" xpos="SPC00">de</tok>
    <tok id="w-387" ord="4" lemma="sant" xpos="NCMS000">sent</tok>
    <tok id="w-388" ord="5" lemma="Vicent" xpos="NP00000">Vicent</tok>
    <tok id="w-389" ord="6" lemma="Ferrer" xpos="NPCS00">Ferrer</tok>
    <tok id="w-99769" ord="23" xpos="CC" lemma="i">e</tok>
    <tok id="w-99770" ord="24" lemma="jo" xpos="PP1CSN00">jo</tok>
    <tok id="w-99771">dar-los 
    <dtok form="dar" id="d-99771-1" ord="25" lemma="dar" xpos="VMN0000" />
    <dtok form="los" id="d-99771-2" ord="26" lemma="els" xpos="L3CP0" /></tok>
    <tok id="w-99772" ord="27" lemma="haver" xpos="V0IF3S0">hé</tok>
    <tok id="w-99773" ord="28" lemma="diner" xpos="NCMP000">diners</tok>
    <space />
    <mark name="/lang" />
    <foreign name="Latin">
      <tok id="w-390" ord="7" lemma="any" xpos="CC">Annum</tok>
    </foreign>
  </text>
</TEI>

Is this the only way to go through an XML file using LXML or there is a faster way? Right now it takes 21 minutes to go through a 30MB file and getting the relevant information associated with the specific XPath expression. I’m using Python 3.11 and a pretty powerful computer. I cannot help but thinking that there must be some more efficient way to do what I’m doing. I have around 400 files in the directory. It takes forever every time that I have to go through them and do something.

EDIT 2:

After following the recommendation to use compiled XPath expressions I ran a test using the revised code provided by @Martin Honnen, and here are the results. I haven’t tried the other recommended alternatives yet. I’ll report when I do.

File sizes:
A-01.xml : 13.2 MB
A-02.xml : 31.4 MB
A-03.xml : 7.7 MB
A-04.xml : 11.6 MB
I-44.xml : 22.9 MB


Original run:

File:      seconds

A-01.xml ➝ 56.845274686813354
A-02.xml ➝ 1281.4102880954742
A-03.xml ➝ 80.60795021057129
A-04.xml ➝ 149.65892505645752
I-44.xml ➝ 983.7257928848267


With compiled XPath expressions:

File:      seconds

A-01.xml ➝ 59.663841009140015
A-02.xml ➝ 1533.5482828617096
A-03.xml ➝ 78.68556118011475
A-04.xml ➝ 149.15855598449707
I-44.xml ➝ 876.2536578178406

Asked By: jfontana

||

Source

Answer 1

Rework my answer according your detailed explaination::

Could you give a try to the XMLPullParser(), should be very fast and none blocking your machine. If you have very large files you can decide which part of the read should be feeded to the parser. In my example I load the whole XML, but this must not be the case in real work.

import xml.etree.ElementTree as ET
import pandas as pd

with open('jfontana.xml', 'r') as input_file:
    xml = input_file.read()

parser = ET.XMLPullParser(['end'])
parser.feed(xml)

data = []
for event, elem in parser.read_events():
    #print(elem)
    if elem.tag == 'dtok':
        #print(elem.tag, elem.text, elem.attrib)
        data.append(elem.attrib)
        
df = pd.DataFrame.from_dict(data)
df.to_csv("jfontana.csv")
print(df)

Output:

  form         id ord lemma     xpos
0  dar  d-99771-1  25   dar  VMN0000
1  los  d-99771-2  26   els    L3CP0

Or if you are interessted in all tok, dtok and text:

import xml.etree.ElementTree as ET
import pandas as pd


with open('jfontana.xml', 'r') as input_file:
    xml = input_file.read()

parser = ET.XMLPullParser(['end'])
parser.feed(xml)

data = []
for event, elem in parser.read_events():
    #print(elem)
    if event =='end' and 'tok' in elem.tag:
        #print(elem.tag, elem.text, elem.attrib)
        dict_row = elem.attrib
        dict_row['text'] = elem.text
        data.append(dict_row)

     
df = pd.DataFrame.from_dict(data)
df = df.replace('n',' ', regex=True) # remove newline for better csv
df.to_csv("jfontana.csv")
print(df)

Output:

   form         id  ord   lemma      xpos           text
0    IX      w-384    1      IX         Z             IX
1   NaN      w-385    2   morir      TMMS           Mort
2   NaN      w-386    3      de     SPC00             de
3   NaN      w-387    4    sant   NCMS000           sent
4   NaN      w-388    5  Vicent   NP00000         Vicent
5   NaN      w-389    6  Ferrer    NPCS00         Ferrer
6   NaN    w-99769   23       i        CC              e
7   NaN    w-99770   24      jo  PP1CSN00             jo
8   dar  d-99771-1   25     dar   VMN0000           None
9   los  d-99771-2   26     els     L3CP0           None
10  NaN    w-99771  NaN     NaN       NaN  dar-los      
11  NaN    w-99772   27   haver   V0IF3S0             hé
12  NaN    w-99773   28   diner   NCMP000         diners
13  NaN      w-390    7     any        CC          Annum

Answered By: Hermann12

Answer 2

As an alternative, I have now also tried to implement most comparisons on Python lists than in lxml XPath, kind of implementing what I tried in XSLT 3 as Python 3 (i.e. selecting with XPath first all tok elements and then looking in XSLT at the sequence or in Python at the list of all toks to find the index of one of the regexp matching toks and find preceding and/or following in that sequence/list):

import os
import csv
import re

from lxml import etree as et

XMLDIR = "original-samples"
myCSV_FILE = "lxml-single-xpath-py-list-original-samples.csv"

ext = ".xml"

tok_path = et.XPath('//tok')

def xml_extract(root_element):

    all_toks = tok_path(root_element)

    matching_toks = filter(lambda tok: re.match(r'^[EeLl][LlOoAa][Ss]*$', "".join(tok.itertext())) is not None and not(tok.get('xpos').startswith('D')), all_toks)

    for el in matching_toks: 

        target = "".join(el.itertext())
        pos = all_toks.index(el)
        
        RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]

        prec1 = RelevantPrecedingElements[-1]
        foll1 = all_toks[pos + 1]

        context_list = []

        for elem in RelevantPrecedingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)

        # adjective = '<' + str(el.text) + '>'
        target = f"<{target}>"
        print(target)
        context_list.append(target)


        lema_fol = foll1.get('lemma') if foll1 is not None else None
        lema_prec = prec1.get('lemma') if prec1 is not None else None
        xpos_fol = foll1.get('xpos') if foll1 is not None else None
        xpos_prec = prec1.get('xpos') if prec1 is not None else None
        form_fol = foll1.text if foll1 is not None else None
        form_prec = prec1.text if prec1 is not None else None

        context = " ".join(context_list)
        print(f"Context is: {context}")


        llista = [
            context,
            lema_prec,
            xpos_prec,
            form_prec,
            target,
            lema_fol,
            xpos_fol,
            form_fol,
        ]

        writer = csv.writer(csv_file, delimiter=";")
        writer.writerow(llista)

with open(myCSV_FILE, "a+", encoding="UTF8", newline="") as csv_file:

    for root, dirs, files in os.walk(XMLDIR):
        for file in files:
            if file.endswith(ext):
                file_path = os.path.join(XMLDIR, file)
                file_root = et.parse(file_path).getroot()
                doc = file
                xml_extract(file_root)

As an alternative, it would be interesting, first for a single file, how SaxonC (https://saxonica.com/saxon-c/1199/) and XSLT 3 performs with an XSLT stylesheet like the one below (for testing, however, I had to comment out the <!DOCTYPE ..> nodes in all samples as the DTD referenced there was not provided):

<xsl:stylesheet ^[EeLl][LlOoAa][Ss]*$') and not(starts-with(@xpos, 'D'))]">
      <xsl:variable name="pos" select="index-of($toks-id, generate-id())"/>
      <xsl:variable name="target" select="'&lt;' || . || '>'"/>
      <xsl:variable name="prec-tok" select="$toks[$pos - 1]"/>
      <xsl:variable name="foll-tok" select="$toks[$pos + 1]"/>
      <xsl:value-of 
        select="let $s := string-join(($toks[position() = ($pos - 6) to ($pos - 1)], $target), ' ') return if (contains($s, '&quot;')) then '&quot;' || replace($s, '&quot;', '&quot;&quot;') || '&quot;' else $s, 
                $prec-tok/@lemma => string(),
                $prec-tok/@xpos => string(),
                $prec-tok => string(),
                $target,
                $foll-tok/@lemma => string(),
                $foll-tok/@xpos => string(),
                $foll-tok => string()" 
                separator=";"/>
      <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
  </xsl:template>
  
</xsl:stylesheet>

That last sample could be run with SaxonC’s Python API using e.g.

from pathlib import Path

from saxonc import *

with PySaxonProcessor(license=True) as proc:
    print(proc.version)

    proc.set_configuration_property('http://saxon.sf.net/feature/validation', 'off')

    proc.set_cwd('.')
    
    xslt30_processor = proc.new_xslt30_processor()

    xslt30_executable = xslt30_processor.compile_stylesheet(stylesheet_file = 'xslt3-original-samples-to-csv.xsl')

    if xslt30_processor.exception_occurred:
        print(xslt30_processor.error_message)
    else:
        
        xslt30_executable.set_base_output_uri(Path('.', 'saxonc-call-template-result.xml').absolute().as_uri())

        xslt30_executable.call_template_returning_value(template_name = None)

        if xslt30_executable.exception_occurred:
            print(xslt30_executable.error_message)

Answered By: Martin Honnen

What is the fastest way to extract content from XML document using LXML?

Question:

Answers: