filter non-nested tag values from XML

Question:

I have an xml that looks like this.

<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" parent_id="12">
        <name>Alpha</name>
        <pos>697</pos>
        <kat_pis>
            <pos kat="2">112</pos>
        </kat_pis>
    </offer>
    <offer id="12" parent_id="31">
        <name>Beta</name>
        <pos>099</pos>
        <kat_pis>
            <pos kat="2">113</pos>
        </kat_pis>
    </offer>
</details>
</main_heading>

I am parsing it using BeautifulSoup. Upon doing this:

soup = BeautifulSoup(file, 'xml')

pos = []
for i in (soup.find_all('pos')):
    pos.append(i.text)

I get a list of all POS tag values, even the ones that are nested within the tag kat_pis.

So I get (697, 112, 099. 113).

However, I only want to get the POS values of the non-nested tags.

Expected desired result is (697, 099).

How can I achieve this?

Asked By: x89

||

Answers:

I think the best solution would be to abandon BeautifulSoup for an XML parser with XPath support, like lxml. Using XPath expressions, you can ask for only those tos elements that are children of offer elements:

from lxml import etree

with open('data.xml') as fd:
    doc = etree.parse(fd)

pos = []
for ele in (doc.xpath('//offer/pos')):
    pos.append(ele.text)

print(pos)

Given your example input, the above code prints:

['697', '099']
Answered By: larsks

Here is one way of getting those first level pos:

from bs4 import BeautifulSoup as bs

xml_doc = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" parent_id="12">
        <name>Alpha</name>
        <pos>697</pos>
        <kat_pis>
            <pos kat="2">112</pos>
        </kat_pis>
    </offer>
    <offer id="12" parent_id="31">
        <name>Beta</name>
        <pos>099</pos>
        <kat_pis>
            <pos kat="2">113</pos>
        </kat_pis>
    </offer>
</details>
</main_heading>'''

soup = bs(xml_doc, 'xml')

pos = []
for i in (soup.select('offer > pos')):
    pos.append(i.text)

print(pos)

Result in terminal:

['697', '099']
Answered By: Barry the Platipus