Python – Print url and name of page only

Question:

I have the following code:

url = requests.get("http://www.ucdenver.edu/pages/ucdwelcomepage.aspx")
soup = BeautifulSoup(res.content, 'html5lib')
scripts = soup.select('script', {"type":"application/ld+json"})

scripts = [script for script in scripts] #for each script in the script, from all scripts found
>! print(scripts)
for script in scripts:
    script.get(res)
print(script)

and from this code I got the result(s):

I want to get into the departments array to capture two elements,

(there are multiple departments in "departments")


{
        "@context": "https://schema.org/",
        "@type": "Organization",
        "url": "https://www.ucdenver.edu",
        "logo": "https://www.ucdenver.edu/images/default-source/global-theme-images/cu_logo.png",
        "name": "University of Colorado Denver",
        "alternateName": "CU Denver",
         "telephone": "1+ 303-315-5969",
        "address": {
                "@type": "PostalAddress",
                "streetAddress": "1201 Larimer Street",
                "addressLocality": "Denver",
                "addressRegion": "CO",
                "postalCode": "80204",
                "addressCountry": "US"
        },
        "department": [{

                        "name": "Center for Undergraduate Exploration and Advising",
                        "email": "mailto:[email protected]",
                         "telephone": "1+ 303-315-1940",
                        "url": "https://www.ucdenver.edu/center-for-undergraduate-exploration-and-advising",
                        "address": [{
                                "@type": "PostalAddress",
                                "streetAddress": "1201 Larimer Street #1113",
                                "addressLocality": "Denver",
                                "addressRegion": "CO",
                                "postalCode": "80204",
                                "addressCountry": "US"
                        }]
                },

from the object I only want to capture "name" and "url".

This is my first time playing with web scraping, but i’m not too sure how you get into "department": [{ to then capture the two elements I want.

Asked By: David

||

Answers:

Once you get back the JSON output you’ve shown as a Python dict and stored it in a variable called data, for example, you can do:

result = []
for department in data["department"]:
    result.append({"name": department["name"], "url": department["url"]})
print(result) # prints out [{"name": "Center for Undergraduate Exploration and Advising", "url": "https://www.ucdenver.edu/center-for-undergraduate-exploration-and-advising"}, {"name": "another name", "url": "another url"}, ...]
Answered By: Safwan Samsudeen

This worked for me:

from bs4 import BeautifulSoup
import requests
import json

res = requests.get("http://www.ucdenver.edu/pages/ucdwelcomepage.aspx")
soup = BeautifulSoup(res.content, 'html5lib')
scripts = soup.find_all(attrs={"type":"application/ld+json"})

for s in scripts:
    content = s.contents[0]      # get the text of the script node
    j = json.loads(content)      # parse it as JSON into a Python data structure
    for dept in j["department"]:
        print(">>>", dept["name"], dept["url"])

You first extract the text of the script node. Then convert that text using the json package to a Python data structure. Then you can iterate through the data using a for-loop.

Answered By: ErikR
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.