How to get all the products from all pages in the subcategory(python, amazon)

Question:

How can I get all the products from all the pages in the subcategory? I attached the program. Now my program is getting only from the first page. I would like to get all the products from that subcategory from all +400 pages so to go to the next page extract all products then to the next page etc. I will appreciate any help.

# selenium imports
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import random

PROXY ="88.157.149.250:8080";


chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
# //a[starts-with(@href, 'https://www.amazon.com/')]/@href
LINKS_XPATH = '//*[contains(@id,"result")]/div/div[3]/div[1]/a'
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get(
    'https://www.amazon.com/s/ref=lp_11444071011_nr_p_8_1/132-3636705-4291947?rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A10971181011%2Cn%3A11444071011%2Cp_8%3A2229059011')
links = browser.find_elements_by_xpath(LINKS_XPATH)
for link in links:
    href = link.get_attribute('href')
    print(href)

Asked By: ryy77

||

Answers:

# selenium imports
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import time


def list_all_items():
    # items = browser.find_elements_by_css_selector('.a-size-base.s-inline.s-access-title.a-text-normal')
    print "Start"
    item_list = []
    items = WebDriverWait(browser, 60).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".a-size-base.s-inline.s-access-title.a-text-normal")))
    print "items--->", items
    if items:
        for item in items:
            print item.text, "nn"
            item_list.append(item.text)
    #time.sleep(3)
    #next_button = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.ID, 'pagnNextString')))
    next_button = WebDriverWait(browser, 60).until(EC.element_to_be_clickable((By.ID, "pagnNextString"))) 
    print "next_button-->", next_button
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    print "____________SCROLL_DONE___"
    next_button.click()
    print "Click_done"
    list_all_items()
#     next_button = browser.find_element_by_id('pagnNextString')
#     next_button.click()

# ifpagnNextString
# https://www.amazon.com/s/ref=lp_11444071011_nr_p_8_1/132-3636705-4291947?rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A10971181011%2Cn%3A11444071011%2Cp_8%3A2229059011


PROXY = "88.157.149.250:8080";

chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('--proxy-server=%s' % PROXY)
# //a[starts-with(@href, 'https://www.amazon.com/')]/@href
LINKS_XPATH = '//*[contains(@id,"result")]/div/div[3]/div[1]/a'
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.maximize_window()
browser.get('https://www.amazon.com/s/ref=lp_11444071011_nr_p_8_1/132-3636705-4291947?rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A10971181011%2Cn%3A11444071011%2Cp_8%3A2229059011')

list_all_items()

i have made one method that will print list of items from all page and call it recursively and at end of method i have click on next button. I did not give the break and exit condition i bellieve that you can manage it. The “list_all_items” method is the logic for do the thing that you required.

also uncomment proxy part that i have commented.

Answered By: Sachhya

Let me break up this problem in a few steps, so you understand what needs to be done here.

First of all, you need to get all the products from a page.

Then, you need to get all the pages and repeat the first step on each and every page.

Now I do not know Python, so I will try to do this as much in a generic way as I can.

First, you need to create an int with value 0.
After that you need to get the number of pages. To do so, check:

numberOfPagesString = browser.find_element_by_xpath('//span[@class='pagnDisabled']').text

numberOfPages = int(numberOfPagesString)

i = 0

Then you need to create a loop. In the loop, you are going to increment the int where you set the value 0, to a maximum of 400.

So now your loop, each time the int is NOT equal to 400, is going to click on next page and get all products, and do what you want it to do. This will result in something like:

while i < numberOfPages **Here, as long as the value of i is less than 400, do this loop**

**code to get all products on page here**

**click on next page link**
browser.find_element_by_id('pagnNextString').click

i++ **here your i will become 1 after first page, 2 after second etc**

So to conclude, first thing you are doing is, determine how many pages are there on the page.

Then you are going to create an int from that string you get back from the browser.

Then you create an int with value 0, which you are going to use to check if you have reached the amount of maximum pages, every time you iterate through the loop.

After that you are going to first get all the products from the page (if you do not do that, it is going to skip the first page).

And at last, its going to click on the next page button.

To finish it, you int i is going to get an increment with ++, so after every loop, it increases by 1.

Answered By: Anand

As you want to get huge piece of data, it’s better to get it with direct HTTP request instead of navigating to each page with Selenium…

Try to iterate through all the pages and scrape required data as below

import requests
from lxml import html

page_counter = 1
links = []

while True:
    headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0"}
    url = "https://www.amazon.com/s/ref=sr_pg_{0}?rh=n%3A3375251%2Cn%3A!3375301%2Cn%3A10971181011%2Cn%3A11444071011%2Cp_8%3A2229059011&page={0}&ie=UTF8&qid=1517398836".format(page_counter)
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        source = html.fromstring(response.content)
        links.extend(source.xpath('//*[contains(@id,"result")]/div/div[3]/div[1]/a/@href'))
        page_counter += 1
    else:
        break

print(links)

P.S. Check this ticket to be able to use proxy with requests library

Answered By: Andersson