How do I find a specific text within an xml page and navigate to that webpage or at least return the URL?

Question:

I have the following xml table from a sitemap.xml url (image attached):
sitemap.xml file from website

I am trying to parse out from the xml the URL by searching for these two criterias if they exist in that URL:

  1. "Average-Weather-in-"
  2. "-Year-Round"

Then return the URL

so in this case I’d be expecting a URL of https://weatherspark.com/y/8004/Average-Weather-in-Austin-Texas-United-States-Year-Round

Asked By: WQureshi

||

Answers:

To find a specific part of HTML or XML documents you can parse them into a DOM data structure and then traverse it programmatically.
Even better, you can use XPath to find specific parts efficiently.

See also

Answered By: Queeg

One solution might be using XML/HTML parser such as beautifulsoup:

import requests
from bs4 import BeautifulSoup

# change to your URL:
url = 'https://weatherspark.com/sitemap-271.xml'

soup = BeautifulSoup(requests.get(url).content, 'xml')

for loc in soup.select('loc'):
    text = loc.text
    if 'Average-Weather-in-' in text and '-Year-Round' in text:
        print(text)

Prints:

...

https://weatherspark.com/y/138371/Average-Weather-in-Mulanay-Philippines-Year-Round
https://weatherspark.com/y/138372/Average-Weather-in-Mangero-Philippines-Year-Round
https://weatherspark.com/y/138373/Average-Weather-in-Malibago-Philippines-Year-Round
https://weatherspark.com/y/138374/Average-Weather-in-Madulao-Philippines-Year-Round
https://weatherspark.com/y/138375/Average-Weather-in-Macalelon-Philippines-Year-Round

...
Answered By: Andrej Kesely

If you are using xsl to transform the XML would use contains(), which takes 2 parameters – the XPath to the target and the text that you are looking for

<xsl:value-of select="contains('url/loc','Average-Weather-in')" />

This will return the url.

Answered By: Bryn Lewis
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.