Get NA for empty slots in lxml xpath()

Question:

I have a big xml (that one): of which I am providing a sample here:

    <?xml version="1.0" encoding="UTF-8"?>
    <hmdb )
    csvfile = open(file, 'w')
    fieldnames = ['normal_concentration_spec',
                  'normal_concentration_conc']

    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for event, elem in context:
        try:
            tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:biospecimen/text()', namespaces=ns)
            normal_concentration_spec = '; '.join(str(e) for e in tl)
        except:
            normal_concentration_spec = 'NA'
        try:
            tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:concentration_value/text()', namespaces=ns)
            normal_concentration_conc = '; '.join(str(e) for e in tl)
        except:
            normal_concentration_conc = 'NA'


        writer.writerow({'normal_concentration_spec': normal_concentration_spec,
                        'normal_concentration_conc': normal_concentration_conc})

        elem.clear()
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
        del context
        return;

hmdbextract('hmdb_file.xml', 'hmmdb_file.csv')

The output csv should look like this:

normal_concentration_spec,normal_concentration_conc
Blood; Feces; Salvia,2.8 +/- 8.8; NA; 5.2
Blood; Feces; Salvia,5; NA; 3-7

In reality I also pull out many other things with only a single value per metabolite which is why I prefer this csv format. However, since the some of the concentration_value slots are empty I will just get different numbers of specimen and values, and wont be able to tell which belongs which in the end,..

How can I make it that I get something like an NA value for each missing concentration_value? (Ideally while keeping the general structure of the code and the lxml package since I have to pull out a lot of things for which this is already set up)

Asked By: yasel

||

Answers:

An empty element will return a zero length list. That could be used to show NA instead

>>> context = etree.iterparse('tmp.xml', tag='{http://www.hmdb.ca}concentration_value')
>>> for event, elem in context:
...     tlc = elem.xpath('text()', namespaces=ns)
...     print(len(tlc), tlc)
... 
1 ['2.8 +/- 8.8']
0 []
1 ['5.2']

Using OP’s code

from lxml import etree

ns = {'hmdb': 'http://www.hmdb.ca'}
context = etree.iterparse('/home/luis/tmp/tmp.xml', tag='{http://www.hmdb.ca}metabolite')

for event, elem in context:
    try:
        tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:biospecimen', namespaces=ns)
        normal_concentration_spec = '; '.join(str(e.text) for e in tl)
    except Exception as ex:
        print(ex)
        normal_concentration_spec = 'NA'
    try:
        tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:concentration_value', namespaces=ns)
        normal_concentration_conc = '; '.join(str(e.text if e.text!=None else 'NA') for e in tl)
    except Exception as ex:
        normal_concentration_conc = 'NA'

    print(normal_concentration_spec, normal_concentration_conc)

Result

Blood; Feces; Salvia 2.8 +/- 8.8; NA; 5.2
Blood; Feces; Salvia 5; NA; 3-7
Answered By: LMC
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.