Python – Print url and name of page only
Question:
I have the following code:
url = requests.get("http://www.ucdenver.edu/pages/ucdwelcomepage.aspx")
soup = BeautifulSoup(res.content, 'html5lib')
scripts = soup.select('script', {"type":"application/ld+json"})
scripts = [script for script in scripts] #for each script in the script, from all scripts found
>! print(scripts)
for script in scripts:
script.get(res)
print(script)
and from this code I got the result(s):
I want to get into the departments array to capture two elements,
(there are multiple departments in "departments")
{
"@context": "https://schema.org/",
"@type": "Organization",
"url": "https://www.ucdenver.edu",
"logo": "https://www.ucdenver.edu/images/default-source/global-theme-images/cu_logo.png",
"name": "University of Colorado Denver",
"alternateName": "CU Denver",
"telephone": "1+ 303-315-5969",
"address": {
"@type": "PostalAddress",
"streetAddress": "1201 Larimer Street",
"addressLocality": "Denver",
"addressRegion": "CO",
"postalCode": "80204",
"addressCountry": "US"
},
"department": [{
"name": "Center for Undergraduate Exploration and Advising",
"email": "mailto:[email protected]",
"telephone": "1+ 303-315-1940",
"url": "https://www.ucdenver.edu/center-for-undergraduate-exploration-and-advising",
"address": [{
"@type": "PostalAddress",
"streetAddress": "1201 Larimer Street #1113",
"addressLocality": "Denver",
"addressRegion": "CO",
"postalCode": "80204",
"addressCountry": "US"
}]
},
from the object I only want to capture "name" and "url".
This is my first time playing with web scraping, but i’m not too sure how you get into "department": [{
to then capture the two elements I want.
Answers:
Once you get back the JSON output you’ve shown as a Python dict
and stored it in a variable called data
, for example, you can do:
result = []
for department in data["department"]:
result.append({"name": department["name"], "url": department["url"]})
print(result) # prints out [{"name": "Center for Undergraduate Exploration and Advising", "url": "https://www.ucdenver.edu/center-for-undergraduate-exploration-and-advising"}, {"name": "another name", "url": "another url"}, ...]
This worked for me:
from bs4 import BeautifulSoup
import requests
import json
res = requests.get("http://www.ucdenver.edu/pages/ucdwelcomepage.aspx")
soup = BeautifulSoup(res.content, 'html5lib')
scripts = soup.find_all(attrs={"type":"application/ld+json"})
for s in scripts:
content = s.contents[0] # get the text of the script node
j = json.loads(content) # parse it as JSON into a Python data structure
for dept in j["department"]:
print(">>>", dept["name"], dept["url"])
You first extract the text of the script node. Then convert that text using the json
package to a Python data structure. Then you can iterate through the data using a for-loop.
I have the following code:
url = requests.get("http://www.ucdenver.edu/pages/ucdwelcomepage.aspx")
soup = BeautifulSoup(res.content, 'html5lib')
scripts = soup.select('script', {"type":"application/ld+json"})
scripts = [script for script in scripts] #for each script in the script, from all scripts found
>! print(scripts)
for script in scripts:
script.get(res)
print(script)
and from this code I got the result(s):
I want to get into the departments array to capture two elements,
(there are multiple departments in "departments")
{
"@context": "https://schema.org/",
"@type": "Organization",
"url": "https://www.ucdenver.edu",
"logo": "https://www.ucdenver.edu/images/default-source/global-theme-images/cu_logo.png",
"name": "University of Colorado Denver",
"alternateName": "CU Denver",
"telephone": "1+ 303-315-5969",
"address": {
"@type": "PostalAddress",
"streetAddress": "1201 Larimer Street",
"addressLocality": "Denver",
"addressRegion": "CO",
"postalCode": "80204",
"addressCountry": "US"
},
"department": [{
"name": "Center for Undergraduate Exploration and Advising",
"email": "mailto:[email protected]",
"telephone": "1+ 303-315-1940",
"url": "https://www.ucdenver.edu/center-for-undergraduate-exploration-and-advising",
"address": [{
"@type": "PostalAddress",
"streetAddress": "1201 Larimer Street #1113",
"addressLocality": "Denver",
"addressRegion": "CO",
"postalCode": "80204",
"addressCountry": "US"
}]
},
from the object I only want to capture "name" and "url".
This is my first time playing with web scraping, but i’m not too sure how you get into "department": [{
to then capture the two elements I want.
Once you get back the JSON output you’ve shown as a Python dict
and stored it in a variable called data
, for example, you can do:
result = []
for department in data["department"]:
result.append({"name": department["name"], "url": department["url"]})
print(result) # prints out [{"name": "Center for Undergraduate Exploration and Advising", "url": "https://www.ucdenver.edu/center-for-undergraduate-exploration-and-advising"}, {"name": "another name", "url": "another url"}, ...]
This worked for me:
from bs4 import BeautifulSoup
import requests
import json
res = requests.get("http://www.ucdenver.edu/pages/ucdwelcomepage.aspx")
soup = BeautifulSoup(res.content, 'html5lib')
scripts = soup.find_all(attrs={"type":"application/ld+json"})
for s in scripts:
content = s.contents[0] # get the text of the script node
j = json.loads(content) # parse it as JSON into a Python data structure
for dept in j["department"]:
print(">>>", dept["name"], dept["url"])
You first extract the text of the script node. Then convert that text using the json
package to a Python data structure. Then you can iterate through the data using a for-loop.