How to webscrape old school website that uses frames
Question:
I am trying to webscrape a government site that uses frameset.
Here is the URL – https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm
I’ve tried using splinter/selenium
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
browser.visit(url)
time.sleep(10)
full_xpath_frame = '/html/frameset/frameset/frame[2]'
tree = browser.find_by_xpath(full_xpath_frame)
for i in tree:
print(i.text)
It just returns an empty string.
I’ve tried using the requests library.
import requests
from lxml import HTML
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
# get response object
response = requests.get(url)
# get byte string
data = response.content
print(data)
And it returns this
b"<html>rn<head>rn<meta http-equiv='Content-Type'rncontent='text/html; charset=iso-
8859-1'>rn<title>Lake_ County Election Results</title>rn</head>rn<FRAMESET rows='20%,
*'>rn<FRAME src='titlebar.htm' scrolling='no'>rn<FRAMESET cols='20%, *'>rn<FRAME
src='menu.htm'>rn<FRAME src='Lake_ElecSumm_all.htm' name='reports'>rn</FRAMESET>
rn</FRAMESET>rn<body>rn</body>rn</html>rn"
I’ve also tried using beautiful soup and it gave me the same thing. Is there another python library I can use in order to get the data that’s inside the second table?
Thank you for any feedback.
Answers:
As mentioned you could go for the frames and its src:
BeautifulSoup(r.text).select('frame')[1].get('src')
or directly to the menu.htm
:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/menu.htm')
link_list = ['https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults'+a.get('href') for a in BeautifulSoup(r.text).select('a')]
for link in link_list[:1]:
r = requests.get(link)
soup = BeautifulSoup(r.text)
###...scrape what is needed
I am trying to webscrape a government site that uses frameset.
Here is the URL – https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm
I’ve tried using splinter/selenium
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
browser.visit(url)
time.sleep(10)
full_xpath_frame = '/html/frameset/frameset/frame[2]'
tree = browser.find_by_xpath(full_xpath_frame)
for i in tree:
print(i.text)
It just returns an empty string.
I’ve tried using the requests library.
import requests
from lxml import HTML
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
# get response object
response = requests.get(url)
# get byte string
data = response.content
print(data)
And it returns this
b"<html>rn<head>rn<meta http-equiv='Content-Type'rncontent='text/html; charset=iso-
8859-1'>rn<title>Lake_ County Election Results</title>rn</head>rn<FRAMESET rows='20%,
*'>rn<FRAME src='titlebar.htm' scrolling='no'>rn<FRAMESET cols='20%, *'>rn<FRAME
src='menu.htm'>rn<FRAME src='Lake_ElecSumm_all.htm' name='reports'>rn</FRAMESET>
rn</FRAMESET>rn<body>rn</body>rn</html>rn"
I’ve also tried using beautiful soup and it gave me the same thing. Is there another python library I can use in order to get the data that’s inside the second table?
Thank you for any feedback.
As mentioned you could go for the frames and its src:
BeautifulSoup(r.text).select('frame')[1].get('src')
or directly to the menu.htm
:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/menu.htm')
link_list = ['https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults'+a.get('href') for a in BeautifulSoup(r.text).select('a')]
for link in link_list[:1]:
r = requests.get(link)
soup = BeautifulSoup(r.text)
###...scrape what is needed