XML parser in BeautifulSoup only scrapes the first symbol out of two

Question

I wish to read symbols from some XML content stored in a text file. When I use xml as a parser, I get the first symbol only. However, I got the two symbols when I use the xml parser. Here is the xml content.

<?xml version="1.0" encoding="utf-8"?>
<lookupdata symbolstring="WDS">
    <key>
        <symbol>WDS</symbol>
        <exchange>NYE</exchange>
        <openfigi>BBG001S5WCY6</openfigi>
        <qmidentifier>USI79Z473117AAG</qmidentifier>
    </key>
    <equityinfo>
        <longname>
        Woodside Energy Group Limited American Depositary Shares each representing one
        </longname>
        <shortname>Woodside Energy </shortname>
        2
        <instrumenttype>equity</instrumenttype>
        <sectype>DR</sectype>
        <isocfi>EDSXFR</isocfi>
        <issuetype>AD</issuetype>
        <proprietaryquoteeligible>false</proprietaryquoteeligible>
    </equityinfo>
</lookupdata>
<lookupdata symbolstring="PAM">
    <key>
        <symbol>PAM</symbol>
        <exchange>NYE</exchange>
        <openfigi>BBG001T5K0S1</openfigi>
        <qmidentifier>USI68Z3Z75887AS</qmidentifier>
    </key>
    <equityinfo>
        <longname>Pampa Energia S.A.</longname>
        <shortname>PAM</shortname>
        <instrumenttype>equity</instrumenttype>
        <sectype>DR</sectype>
        <isocfi>EDSXFR</isocfi>
        <issuetype>AD</issuetype>
    </equityinfo>
</lookupdata>

When I read the xml content from a text file and parse the symbols, I get only the first symbol.

from bs4 import BeautifulSoup

with open("input_xml.txt") as infile:
    item = infile.read()

soup = BeautifulSoup(item,"xml")
for item in soup.select("lookupdata symbol"):
    print(item.text)

current output:

WDS

If I replace xml with lxml in BeautifulSoup(item,"xml"), I get both symbols. When I use lxml, a warning pops up, though.

As the content is xml, I would like to stick to xml parser instead of lxml.

Expected output:

WDS
PAM

Asked By: SMTH

||

Source

Answer 1

The issue seems to be that the builtin xml library only loads the first item, it just stops after the first lookupdata ends. Given all the examples in the xml docs have some top-level container element, I’m assuming it just stops parsing after the first top-level element ends (though am not sure, just an assumption). You can add a print(soup) after you load it in to see what its using.

You could use BeautifulSoup(item, "html.parser") which uses the builtin html library, which works.

Or, to keep using the xml library, surround it with some top-level dummy element, like:

from bs4 import BeautifulSoup

with open("input_xml.txt") as infile:
    item = infile.read()

patched = f"<root>{item}</root>"

soup = BeautifulSoup(patched, "xml")
for found in soup.select("lookupdata symbol"):
    print(found.text)

Output:

WDS
PAM

Answered By: Sean Breckenridge

XML parser in BeautifulSoup only scrapes the first symbol out of two

Question:

Answers: