How do I find a specific text within an xml page and navigate to that webpage or at least return the URL?
Question:
I have the following xml table from a sitemap.xml url (image attached):
sitemap.xml file from website
I am trying to parse out from the xml the URL by searching for these two criterias if they exist in that URL:
- "Average-Weather-in-"
- "-Year-Round"
Then return the URL
so in this case I’d be expecting a URL of https://weatherspark.com/y/8004/Average-Weather-in-Austin-Texas-United-States-Year-Round
Answers:
To find a specific part of HTML or XML documents you can parse them into a DOM data structure and then traverse it programmatically.
Even better, you can use XPath to find specific parts efficiently.
See also
One solution might be using XML/HTML parser such as beautifulsoup
:
import requests
from bs4 import BeautifulSoup
# change to your URL:
url = 'https://weatherspark.com/sitemap-271.xml'
soup = BeautifulSoup(requests.get(url).content, 'xml')
for loc in soup.select('loc'):
text = loc.text
if 'Average-Weather-in-' in text and '-Year-Round' in text:
print(text)
Prints:
...
https://weatherspark.com/y/138371/Average-Weather-in-Mulanay-Philippines-Year-Round
https://weatherspark.com/y/138372/Average-Weather-in-Mangero-Philippines-Year-Round
https://weatherspark.com/y/138373/Average-Weather-in-Malibago-Philippines-Year-Round
https://weatherspark.com/y/138374/Average-Weather-in-Madulao-Philippines-Year-Round
https://weatherspark.com/y/138375/Average-Weather-in-Macalelon-Philippines-Year-Round
...
If you are using xsl to transform the XML would use contains(), which takes 2 parameters – the XPath to the target and the text that you are looking for
<xsl:value-of select="contains('url/loc','Average-Weather-in')" />
This will return the url.
I have the following xml table from a sitemap.xml url (image attached):
sitemap.xml file from website
I am trying to parse out from the xml the URL by searching for these two criterias if they exist in that URL:
- "Average-Weather-in-"
- "-Year-Round"
Then return the URL
so in this case I’d be expecting a URL of https://weatherspark.com/y/8004/Average-Weather-in-Austin-Texas-United-States-Year-Round
To find a specific part of HTML or XML documents you can parse them into a DOM data structure and then traverse it programmatically.
Even better, you can use XPath to find specific parts efficiently.
See also
One solution might be using XML/HTML parser such as beautifulsoup
:
import requests
from bs4 import BeautifulSoup
# change to your URL:
url = 'https://weatherspark.com/sitemap-271.xml'
soup = BeautifulSoup(requests.get(url).content, 'xml')
for loc in soup.select('loc'):
text = loc.text
if 'Average-Weather-in-' in text and '-Year-Round' in text:
print(text)
Prints:
...
https://weatherspark.com/y/138371/Average-Weather-in-Mulanay-Philippines-Year-Round
https://weatherspark.com/y/138372/Average-Weather-in-Mangero-Philippines-Year-Round
https://weatherspark.com/y/138373/Average-Weather-in-Malibago-Philippines-Year-Round
https://weatherspark.com/y/138374/Average-Weather-in-Madulao-Philippines-Year-Round
https://weatherspark.com/y/138375/Average-Weather-in-Macalelon-Philippines-Year-Round
...
If you are using xsl to transform the XML would use contains(), which takes 2 parameters – the XPath to the target and the text that you are looking for
<xsl:value-of select="contains('url/loc','Average-Weather-in')" />
This will return the url.