Web Scraper: I need help pulling out the text in between the attribute… Any help would be appreciate

Question

Link =
https://www.imdb.com/search/title/?title_type=video_game&amp&sort=user_rating,desc&amp&after=1&amp&ref_=adv_nxt

My Goal

I need to collect all of the video game titles, genre, description, type, and release year on every page.

My Problem
https://www.imdb.com/search/title/?title_type=video_game&sort=user_rating,desc&start=9951&ref_=adv_nxt

total_games = 26,215

The "start=9951" changes to "after=WzUuNSwidHQ4NjcxMDM2IiwxMDAwMV0%3D" on the next page iteration

I was originally going to loop: pages = np.arange(1, total_games, 50), every page from 1 to 26215 every 50 entries, but then I stumbled upon this problem.

HTML: < a href="/search/title/?title_type=video_game&sort=user_rating,desc&after=WzUuNSwidHQxODAxMDU0IiwxMDA1MV0%3D&ref_=adv_nxt" class="lister-page-next next-page">Next »< /a>

How do I take out a portion of the href link and add to the overall link to loop?

Outcome:

"https://www.imdb.com/search/title/?title_type=video_game&sort=user_rating,desc&" + "after=WzUuNSwidHQ4NjcxMDM2IiwxMDAwMV0%3D" + "&ref_=adv_nxt"

Bold: This is the part of HREF I want to grab on each page to iterate to the next page/This is inside the href that changes.

Asked By: Basic Goat Trades

||

Source

Answer 1

You can save yourself the headache and simply check if the "Next" button exist in the HTML. If it does you just extract the href and follow the link else you’ve reached the last page.

Assuming you’re using BeautifulSoup and you’ve prepared your soup:

next_link_tag = soup.find('a', {'class': 'next-page'}) # Find the a tag with a class "next-page"
if next_link_tag: # If there is any
    next_link = next_link_tag.get('href') # Get the href (Don't forget to prepend it with 'https://www.imdb.com/')
else:
    pass # There's no next page. Act accordingly

Answered By: FluidLight

Web Scraper: I need help pulling out the text in between the attribute… Any help would be appreciate

Question:

Answers: