BeautifulSoup scrape – Fail to retrieve product list

Question

I am reaching out because I am having some troubles to adjust a piece of code that is supposed to scrape some information from Amazon product pages (title, url, product name, etc.). Classic stuff for scraping training 🙂

So I basically wrote through different function:

One function to generate the URL to scrape
One function to navigate across the different element & made the value extracts

At the end I just run my driver & beautifulsoup & launch the two functions.

However, the outcome is not what I expect. I’d like to retrieve an organized csv file with 1 row per product retrieved & each associated information into columns. Nevertheless, I always end up with 1 or 2 rows, but not all products from all pages.

I assume this is something coming from my soup along with the "for loop" that is not properly going through all items (though I can’t figure out what exactly).

I’d like to get your opinion about this, do you have any clues?

Thank you very much for the help

from bs4 import BeautifulSoup
from selenium import webdriver
import csv

#Function to generate URL with search KW & page nb
def get_url(search_term,page):
    template = 'https://www.amazon.co.uk/s?k={}&page='+str(page)
    search_term = search_term.replace(' ','+')
    url = template.format(search_term)

    return url

#Function to retrieve all data from the page
def extract_record(item):
    atag = item.h2.a
    
    #Retrieve product name
    description = atag.text.strip()
    
    #Retrieve product URL
    url = 'https://www.amazon.co.uk' + atag.get('href')
    
    #Retrieve sponsored status
    try:
        sponso_parent = item.find('span','s-label-popover-default')
        sponso = sponso_parent.find('span', {'class': 'a-size-mini a-color-secondary', 'dir': 'auto'}).text
    except AttributeError:
        sponso = 'No' 

    #Retrieve price info
    try:
        price_parent = item.find('span','a-price')
        price = price_parent.find('span','a-offscreen').text
    except AttributeError:
        return
    
    #Retrieve avg product rating
    try:
        rating = item.i.text
    except AttributeError:
        rating = ''
    
    #Retrieve review count (if monetary value, nill it due to missing value)
    try:
        review_count = item.find('span', {'class': 'a-size-base', 'dir': 'auto'}).text
    except AttributeError:
        review_count = ''
    
    if "£" in review_count or "€" in review_count or "$" in review_count:
        review_count = 0
    
    result = (url, description, sponso, price, rating,  review_count)
    
    return result
        
record_final = []

#Loop through page nb
for page in range(1,3):
    url = get_url('laptop',page)
    print(url)
    
    #Instantiate web driver & retrieve page content with BS (then loop through every product)
    driver = webdriver.Chrome('\Users\rapha\Desktop\chromedriver.exe')
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    final_soup = soup.find_all('div',{'data-component-type': 's-search-result'})
    
    try:
        for item in final_soup:
            record = extract_record(item)
            if record:
                record_final.append(record)
    except AttributeError:
        print('error_record')
    
    driver.close()

with open('resultsamz.csv','w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['url', 'description', 'sponso', 'price', 'rating','review_count'])
    writer.writerow(record_final)

Asked By: Raphaël Ambit

||

Source

Answer 1

You have to iterate over your final record to save each iten in a row.

Change this:

writer.writerow(record_final)

For this:

for item in record_final:
    writer.writerow(item)

Answered By: Arthur Pereira

Answer 2

Your code is doing what you have told it to do.

# Retrieve review count (if monetary value, nill it due to missing value)

This is what you are getting

('https://www.amazon.co.uk/G-Anica%C2%AE-Portable-Ultrabook-Earphone-Accessories/dp/B08FCFDPVF/ref=sr_1_10?dchild=1&keywords=laptop&qid=1606453924&sr=8-10', 'G-Anica® Netbook Laptop PC 10 inch Android Portable Ultrabook,Dual Core, Wifi,with Laptop Bag + Mouse + Mouse Pad + Earphone (4 PCS Computer Accessories) (Pink)', 'No', '£119.99', '3.4 out of 5 stars', '21')
('https://www.amazon.co.uk/CHERRY%C2%AE-Notebook-Netbook-Computer-Keyboard/dp/B07ZPW7R14/ref=sr_1_11?dchild=1&keywords=laptop&qid=1606453924&sr=8-11', 'FANCY CHERRY® NEW 2018 HD 10 inch Mini Laptop Notebook Netbook Tablet Computer 1G DDR3 8GB Memory VIA WM8880 CPU Dual Core Android Screen Wifi Camera Keyboard USB HDMI (Black 8GB)', 'No', '£109.99', '3.3 out of 5 stars', '111')
None
None
None
None
None
https://www.amazon.co.uk/s?k=laptop&page=2

Now, if you visit the page, there are many laptops without the price. Your code is skipping those as you told it to.

Answered By: Abhishek Rai

BeautifulSoup scrape – Fail to retrieve product list

Question:

Answers: