Scraping an page on the internet for numbers
Question:
I have the following code. It opens up a mass lottery page trying to get the winning numbers. It doesn’t work. The path looks good though. Please help.
from bs4 import BeautifulSoup
import requests
url = 'https://www.masslottery.com/games/draw-and-instants/mass-cash?date=2023-02-22'
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
# Scrape the numbers
numbers = soup.find_all('span', attrs={'class': "winning-number-ball-circle"})
# Convert the numbers to int type
numbers = [int(number.text) for number in numbers]
# Print the numbers
print(numbers)
Answers:
Upon inspecting the webpage, we will discover that the HTML source code (stored in html_content
) did not contain the relevant information for the game results (print html_content
to check it out). It is because the webpage was obtaining the results via an API, accessible at:
https://www.masslottery.com/api/v1/draw-results
Instead, let’s try to GET the result needed from there. For historical results, do make a GET request with the appropriate parameters by replacing the YYYY-MM-DD
with the desired date in ISO format:
https://www.masslottery.com/api/v1/draw-results/mass_cash?draw_date=2023-02-21
Bonus: Feel free to check out the site’s network activity (DevTools) at: Inspect > Network (for Chrome), or a similar procedure for other browsers, as it provides useful information about the API requests and responses.
Sample code to handle the API response as requested:
import requests
import json
# parameters
url = "https://www.masslottery.com/api/v1/draw-results/mass_cash"
params = {"draw_date": "2023-02-21"}
# parse the JSON response
response = requests.get(url, params=params)
data = response.json()
# access the first item in this list "winningNumbers"
# and retrieve the "winningNumbers" key
winning_numbers = data["winningNumbers"][0]["winningNumbers"]
print(winning_numbers)
I have the following code. It opens up a mass lottery page trying to get the winning numbers. It doesn’t work. The path looks good though. Please help.
from bs4 import BeautifulSoup
import requests
url = 'https://www.masslottery.com/games/draw-and-instants/mass-cash?date=2023-02-22'
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
# Scrape the numbers
numbers = soup.find_all('span', attrs={'class': "winning-number-ball-circle"})
# Convert the numbers to int type
numbers = [int(number.text) for number in numbers]
# Print the numbers
print(numbers)
Upon inspecting the webpage, we will discover that the HTML source code (stored in html_content
) did not contain the relevant information for the game results (print html_content
to check it out). It is because the webpage was obtaining the results via an API, accessible at:
https://www.masslottery.com/api/v1/draw-results
Instead, let’s try to GET the result needed from there. For historical results, do make a GET request with the appropriate parameters by replacing the YYYY-MM-DD
with the desired date in ISO format:
https://www.masslottery.com/api/v1/draw-results/mass_cash?draw_date=2023-02-21
Bonus: Feel free to check out the site’s network activity (DevTools) at: Inspect > Network (for Chrome), or a similar procedure for other browsers, as it provides useful information about the API requests and responses.
Sample code to handle the API response as requested:
import requests
import json
# parameters
url = "https://www.masslottery.com/api/v1/draw-results/mass_cash"
params = {"draw_date": "2023-02-21"}
# parse the JSON response
response = requests.get(url, params=params)
data = response.json()
# access the first item in this list "winningNumbers"
# and retrieve the "winningNumbers" key
winning_numbers = data["winningNumbers"][0]["winningNumbers"]
print(winning_numbers)