Scraping Book rank from Amazon Book page — optimizing code
Question:
I developed a function to get the book rank from an Amazon book page, but I’m not extremely satisfied by it. It would be great to know if this can be optimised to collect the rank in a more efficient way (by efficient I was thinking perhaps it’s possible to use something like "if string contains" though I have not been successful here). Please find the code below:
def scrap_rank_amz(link):
# Step 1 — Get URL and content
url = link
request = Request(url, headers={"User-agent": "Mozilla/5.0"})
html = urlopen(request)
# Step 2 — Create BeautifulSoup Instance to get elements from HTML
soup = BeautifulSoup(html, "html.parser")
# Step 3 — Collect the info from the block where we will find the rank
soup_rank = soup.find_all(
class_="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list",
limit=2,
)
# Step 4 — Obtain specifically the items where we'll find the rank
soup_rank_detail = [i.find(class_="a-list-item").get_text(" ") for i in soup_rank]
# Step 5 — Obtain the rank
soup_rank_detail_lv2 = soup_rank_detail[1][24:32]
# Step 6 — Return rank value
return soup_rank_detail_lv2
You can find an example of a link to be used as follows: https://www.amazon.com/Moonshine-Magic-Southern-Charms-Mystery-ebook/dp/B078SZLXB3
Thanks a lot for your time!
Sara
Answers:
Let’s say all the pages you want to scrap have a Kindle ranking, then you can use a simple regexp.
import re
from typing import Optional
from urllib.request import urlopen, Request
def get_kindle_rank(link: str) -> Optional[str]:
request = Request(link, headers={"User-agent": "Mozilla/5.0"})
html = urlopen(request).read().decode()
regexp_match = re.search(r"#([d,]+) in Kindle", html)
if regexp_match:
return regexp_match.group(1)
else:
return None
get_kindle_rank("https://www.amazon.com/Moonshine-Magic-Southern-Charms-Mystery-ebook/dp/B078SZLXB3")
# '43,847'
However, it won’t be much faster as most of the runtime will be spent on the request itself and not the text parsing.
I developed a function to get the book rank from an Amazon book page, but I’m not extremely satisfied by it. It would be great to know if this can be optimised to collect the rank in a more efficient way (by efficient I was thinking perhaps it’s possible to use something like "if string contains" though I have not been successful here). Please find the code below:
def scrap_rank_amz(link):
# Step 1 — Get URL and content
url = link
request = Request(url, headers={"User-agent": "Mozilla/5.0"})
html = urlopen(request)
# Step 2 — Create BeautifulSoup Instance to get elements from HTML
soup = BeautifulSoup(html, "html.parser")
# Step 3 — Collect the info from the block where we will find the rank
soup_rank = soup.find_all(
class_="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list",
limit=2,
)
# Step 4 — Obtain specifically the items where we'll find the rank
soup_rank_detail = [i.find(class_="a-list-item").get_text(" ") for i in soup_rank]
# Step 5 — Obtain the rank
soup_rank_detail_lv2 = soup_rank_detail[1][24:32]
# Step 6 — Return rank value
return soup_rank_detail_lv2
You can find an example of a link to be used as follows: https://www.amazon.com/Moonshine-Magic-Southern-Charms-Mystery-ebook/dp/B078SZLXB3
Thanks a lot for your time!
Sara
Let’s say all the pages you want to scrap have a Kindle ranking, then you can use a simple regexp.
import re
from typing import Optional
from urllib.request import urlopen, Request
def get_kindle_rank(link: str) -> Optional[str]:
request = Request(link, headers={"User-agent": "Mozilla/5.0"})
html = urlopen(request).read().decode()
regexp_match = re.search(r"#([d,]+) in Kindle", html)
if regexp_match:
return regexp_match.group(1)
else:
return None
get_kindle_rank("https://www.amazon.com/Moonshine-Magic-Southern-Charms-Mystery-ebook/dp/B078SZLXB3")
# '43,847'
However, it won’t be much faster as most of the runtime will be spent on the request itself and not the text parsing.