XML parser in BeautifulSoup only scrapes the first symbol out of two
Question:
I wish to read symbols from some XML content stored in a text file. When I use xml
as a parser, I get the first symbol only. However, I got the two symbols when I use the xml
parser. Here is the xml content.
<?xml version="1.0" encoding="utf-8"?>
<lookupdata symbolstring="WDS">
<key>
<symbol>WDS</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001S5WCY6</openfigi>
<qmidentifier>USI79Z473117AAG</qmidentifier>
</key>
<equityinfo>
<longname>
Woodside Energy Group Limited American Depositary Shares each representing one
</longname>
<shortname>Woodside Energy </shortname>
2
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
<proprietaryquoteeligible>false</proprietaryquoteeligible>
</equityinfo>
</lookupdata>
<lookupdata symbolstring="PAM">
<key>
<symbol>PAM</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001T5K0S1</openfigi>
<qmidentifier>USI68Z3Z75887AS</qmidentifier>
</key>
<equityinfo>
<longname>Pampa Energia S.A.</longname>
<shortname>PAM</shortname>
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
</equityinfo>
</lookupdata>
When I read the xml content from a text file and parse the symbols, I get only the first symbol.
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
soup = BeautifulSoup(item,"xml")
for item in soup.select("lookupdata symbol"):
print(item.text)
current output:
WDS
If I replace xml
with lxml
in BeautifulSoup(item,"xml")
, I get both symbols. When I use lxml
, a warning pops up, though.
As the content is xml, I would like to stick to xml
parser instead of lxml
.
Expected output:
WDS
PAM
Answers:
The issue seems to be that the builtin xml
library only loads the first item, it just stops after the first lookupdata
ends. Given all the examples in the xml docs have some top-level container element, I’m assuming it just stops parsing after the first top-level element ends (though am not sure, just an assumption). You can add a print(soup)
after you load it in to see what its using.
You could use BeautifulSoup(item, "html.parser")
which uses the builtin html
library, which works.
Or, to keep using the xml
library, surround it with some top-level dummy element, like:
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
patched = f"<root>{item}</root>"
soup = BeautifulSoup(patched, "xml")
for found in soup.select("lookupdata symbol"):
print(found.text)
Output:
WDS
PAM
I wish to read symbols from some XML content stored in a text file. When I use xml
as a parser, I get the first symbol only. However, I got the two symbols when I use the xml
parser. Here is the xml content.
<?xml version="1.0" encoding="utf-8"?>
<lookupdata symbolstring="WDS">
<key>
<symbol>WDS</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001S5WCY6</openfigi>
<qmidentifier>USI79Z473117AAG</qmidentifier>
</key>
<equityinfo>
<longname>
Woodside Energy Group Limited American Depositary Shares each representing one
</longname>
<shortname>Woodside Energy </shortname>
2
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
<proprietaryquoteeligible>false</proprietaryquoteeligible>
</equityinfo>
</lookupdata>
<lookupdata symbolstring="PAM">
<key>
<symbol>PAM</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001T5K0S1</openfigi>
<qmidentifier>USI68Z3Z75887AS</qmidentifier>
</key>
<equityinfo>
<longname>Pampa Energia S.A.</longname>
<shortname>PAM</shortname>
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
</equityinfo>
</lookupdata>
When I read the xml content from a text file and parse the symbols, I get only the first symbol.
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
soup = BeautifulSoup(item,"xml")
for item in soup.select("lookupdata symbol"):
print(item.text)
current output:
WDS
If I replace xml
with lxml
in BeautifulSoup(item,"xml")
, I get both symbols. When I use lxml
, a warning pops up, though.
As the content is xml, I would like to stick to xml
parser instead of lxml
.
Expected output:
WDS
PAM
The issue seems to be that the builtin xml
library only loads the first item, it just stops after the first lookupdata
ends. Given all the examples in the xml docs have some top-level container element, I’m assuming it just stops parsing after the first top-level element ends (though am not sure, just an assumption). You can add a print(soup)
after you load it in to see what its using.
You could use BeautifulSoup(item, "html.parser")
which uses the builtin html
library, which works.
Or, to keep using the xml
library, surround it with some top-level dummy element, like:
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
patched = f"<root>{item}</root>"
soup = BeautifulSoup(patched, "xml")
for found in soup.select("lookupdata symbol"):
print(found.text)
Output:
WDS
PAM