Scraping Book rank from Amazon Book page — optimizing code

Question:

I developed a function to get the book rank from an Amazon book page, but I’m not extremely satisfied by it. It would be great to know if this can be optimised to collect the rank in a more efficient way (by efficient I was thinking perhaps it’s possible to use something like "if string contains" though I have not been successful here). Please find the code below:

def scrap_rank_amz(link):
    # Step 1 — Get URL and content
    url = link
    request = Request(url, headers={"User-agent": "Mozilla/5.0"})
    html = urlopen(request)

    # Step 2 — Create BeautifulSoup Instance to get elements from HTML
    soup = BeautifulSoup(html, "html.parser")

    # Step 3 — Collect the info from the block where we will find the rank
    soup_rank = soup.find_all(
        class_="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list",
        limit=2,
    )

    # Step 4 — Obtain specifically the items where we'll find the rank
    soup_rank_detail = [i.find(class_="a-list-item").get_text(" ") for i in soup_rank]

    # Step 5 — Obtain the rank
    soup_rank_detail_lv2 = soup_rank_detail[1][24:32]

    # Step 6 — Return rank value
    return soup_rank_detail_lv2

You can find an example of a link to be used as follows: https://www.amazon.com/Moonshine-Magic-Southern-Charms-Mystery-ebook/dp/B078SZLXB3

Thanks a lot for your time!

Sara

Asked By: Sara U.

||

Answers:

Let’s say all the pages you want to scrap have a Kindle ranking, then you can use a simple regexp.

import re

from typing import Optional
from urllib.request import urlopen, Request

def get_kindle_rank(link: str) -> Optional[str]:
    request = Request(link, headers={"User-agent": "Mozilla/5.0"})
    html = urlopen(request).read().decode()
    
    regexp_match = re.search(r"#([d,]+) in Kindle", html)
    if regexp_match:
        return regexp_match.group(1) 
    else:
        return None

get_kindle_rank("https://www.amazon.com/Moonshine-Magic-Southern-Charms-Mystery-ebook/dp/B078SZLXB3")
# '43,847'

However, it won’t be much faster as most of the runtime will be spent on the request itself and not the text parsing.

Answered By: RobinFrcd
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.