How to scrape specific name from link bs4
Question:
I’m trying to use bs4 to scrape this webpage to get the titles of the "Episode" and the rating. I already have the rating down and I’m using the following code
first_url = 'https://www.imdb.com/search/title/?series=tt0206512&view=simple&sort=release_date,asc'
page = requests.get(first_url)
soup = BeautifulSoup(page.content, 'html.parser')
# get a list of descriptions to parse
ratings = soup.find_all("div",{"class": "col-imdb-rating"})
However, when I try to use the tag ‘a’, it’s not quite working. Does anyone have suggestions on how to get each episode name from this website?
So I’m looking for here: "Help Wanted/Reef Blower/Tea at the Treedome"
Answers:
When a URL is given as /some/folder/somepage
, it’s with respect to the root page (https://www.imdb.com
in this case). So get the href
value from <a>
tag and append it, to get https://www.imdb.com/title/tt0707293/?ref_=adv_li_tt
.
there are many a
elements on the website, therefore all episodes can be obtained by using the closest element (in this case small) to retrieve element a which contains the episode. The closest element to small can be represented by a +
sign. Try this
episodes = soup.select("div.lister-item small + a[href]")
for episode in episodes:
print(episode.text)
I’m trying to use bs4 to scrape this webpage to get the titles of the "Episode" and the rating. I already have the rating down and I’m using the following code
first_url = 'https://www.imdb.com/search/title/?series=tt0206512&view=simple&sort=release_date,asc'
page = requests.get(first_url)
soup = BeautifulSoup(page.content, 'html.parser')
# get a list of descriptions to parse
ratings = soup.find_all("div",{"class": "col-imdb-rating"})
However, when I try to use the tag ‘a’, it’s not quite working. Does anyone have suggestions on how to get each episode name from this website?
So I’m looking for here: "Help Wanted/Reef Blower/Tea at the Treedome"
When a URL is given as /some/folder/somepage
, it’s with respect to the root page (https://www.imdb.com
in this case). So get the href
value from <a>
tag and append it, to get https://www.imdb.com/title/tt0707293/?ref_=adv_li_tt
.
there are many a
elements on the website, therefore all episodes can be obtained by using the closest element (in this case small) to retrieve element a which contains the episode. The closest element to small can be represented by a +
sign. Try this
episodes = soup.select("div.lister-item small + a[href]")
for episode in episodes:
print(episode.text)