Web Scraper: I need help pulling out the text in between the attribute… Any help would be appreciate
Question:
My Goal
I need to collect all of the video game titles, genre, description, type, and release year on every page.
total_games = 26,215
The "start=9951" changes to "after=WzUuNSwidHQ4NjcxMDM2IiwxMDAwMV0%3D" on the next page iteration
I was originally going to loop: pages = np.arange(1, total_games, 50), every page from 1 to 26215 every 50 entries, but then I stumbled upon this problem.
HTML: < a href="/search/title/?title_type=video_game&sort=user_rating,desc&after=WzUuNSwidHQxODAxMDU0IiwxMDA1MV0%3D&ref_=adv_nxt" class="lister-page-next next-page">Next ยป< /a>
How do I take out a portion of the href link and add to the overall link to loop?
Outcome:
"https://www.imdb.com/search/title/?title_type=video_game&sort=user_rating,desc&" + "after=WzUuNSwidHQ4NjcxMDM2IiwxMDAwMV0%3D" + "&ref_=adv_nxt"
Bold: This is the part of HREF I want to grab on each page to iterate to the next page/This is inside the href that changes.
Answers:
You can save yourself the headache and simply check if the "Next" button exist in the HTML. If it does you just extract the href and follow the link else you’ve reached the last page.
Assuming you’re using BeautifulSoup and you’ve prepared your soup:
next_link_tag = soup.find('a', {'class': 'next-page'}) # Find the a tag with a class "next-page"
if next_link_tag: # If there is any
next_link = next_link_tag.get('href') # Get the href (Don't forget to prepend it with 'https://www.imdb.com/')
else:
pass # There's no next page. Act accordingly
My Goal
I need to collect all of the video game titles, genre, description, type, and release year on every page.
total_games = 26,215
The "start=9951" changes to "after=WzUuNSwidHQ4NjcxMDM2IiwxMDAwMV0%3D" on the next page iteration
I was originally going to loop: pages = np.arange(1, total_games, 50), every page from 1 to 26215 every 50 entries, but then I stumbled upon this problem.
HTML: < a href="/search/title/?title_type=video_game&sort=user_rating,desc&after=WzUuNSwidHQxODAxMDU0IiwxMDA1MV0%3D&ref_=adv_nxt" class="lister-page-next next-page">Next ยป< /a>
How do I take out a portion of the href link and add to the overall link to loop?
Outcome:
"https://www.imdb.com/search/title/?title_type=video_game&sort=user_rating,desc&" + "after=WzUuNSwidHQ4NjcxMDM2IiwxMDAwMV0%3D" + "&ref_=adv_nxt"
Bold: This is the part of HREF I want to grab on each page to iterate to the next page/This is inside the href that changes.
You can save yourself the headache and simply check if the "Next" button exist in the HTML. If it does you just extract the href and follow the link else you’ve reached the last page.
Assuming you’re using BeautifulSoup and you’ve prepared your soup:
next_link_tag = soup.find('a', {'class': 'next-page'}) # Find the a tag with a class "next-page"
if next_link_tag: # If there is any
next_link = next_link_tag.get('href') # Get the href (Don't forget to prepend it with 'https://www.imdb.com/')
else:
pass # There's no next page. Act accordingly