Scrap web content from Amazon

Question:

I’m trying to scrape Amazon prices with phantomjs and python. I want to parse it with beautiful soup, to get the new and used prices for books, the problem is: when I pass the source of the request I do with phantomjs the prices are just 0,00, the code is this simple test.

I don’t understand if is amazon who have measures to avoid scraping prices or I’m doing it wrong because I was trying with other more simple pages and I can get the data I want.

PD I’m in a country not supported to use amazon API, that’s why the scraper is necesary

import re
import urlparse

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

link = 'http://www.amazon.com/gp/offer-listing/1119998956/ref=dp_olp_new?ie=UTF8&condition=new'#'http://www.amazon.com/gp/product/1119998956'

class AmzonScraper(object):
    def __init__(self):
        self.driver = webdriver.PhantomJS()
        self.driver.set_window_size(1120, 550)

    def scrape_prices(self):
        self.driver.get(link)
        s = BeautifulSoup(self.driver.page_source)
        return s

    def scrape(self):
        source = self.scrape_prices()
        print source
        self.driver.quit()

if __name__ == '__main__':
    scraper = TaleoJobScraper()
    scraper.scrape()
Asked By: mch505

||

Answers:

First of all, to follow @Nick Bailey’s comment, study the Terms of Use and make sure there are no violations on your side.

To solve it, you need to tweak PhantomJS desired capabilities:

caps = webdriver.DesiredCapabilities.PHANTOMJS
caps["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87"

self.driver = webdriver.PhantomJS(desired_capabilities=caps)
self.driver.maximize_window()

And, to make it bullet-proof, you can make a Custom Expected Condition and wait for the price to become non-zero:

from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class wait_for_price(object):
    def __init__(self, locator):
        self.locator = locator

    def __call__(self, driver):
        try :
            element_text = EC._find_element(driver, self.locator).text.strip()
            return element_text != "0,00"
        except StaleElementReferenceException:
            return False

Usage:

def scrape_prices(self):
    self.driver.get(link)

    WebDriverWait(self.driver, 200).until(wait_for_price((By.CLASS_NAME, "olpOfferPrice")))
    s = BeautifulSoup(self.driver.page_source)

    return s
Answered By: alecxe

Good answer on setting the user agent for phantomjs to that of a normal browser. Since you said that your country is being blocked by amazon, then I would imagine that you also need to set a proxy.

here is an example of how to start phantomJS in python with a firefox useragent and a proxy.

from selenium.webdriver import *
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
service_args = [ '--proxy=1.1.1.1:port', '--proxy-auth=username:pass'  ]
dcap = dict( DesiredCapabilities.PHANTOMJS )
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0"
driver = PhantomJS( desired_capabilities = dcap, service_args=service_args )

where 1.1.1.1 is your proxy ip and port is the proxy port. Also username and password are only necessary if your proxy requires authentication.

Answered By: Ryan Hovey

Another framework to try is Scrapy it is simpler than selenium, which is used to simulate browser interactions. Scrapy gives you classes for easily parsing data using CSS selectors or XPath, and a pipeline to store that data in whatever format you’d like, like writing it to a MongoDB database for example

Often times you can write a fully build spider and deploy it to the Scrapy cloud in under 10 lines of code

Checkout this YT video on how to use Scrapy for scraping Amazon reviews as a use case

Answered By: Chris Varriale