How to use scrapy for Amazon.com links after "Next" Button?

Question:

I am relatively new to Python and Scrapy. I’m trying to scrap the links in “Customers who bought this item also bought”.
For example: http://www.amazon.com/Confessions-Economic-Hit-John-Perkins-ebook/dp/B001AFF266/. There are 17 pages for “Customers who bought this item also bought”. If I ask scrapy to scrap that url, it only scraps the first page (6 items). How do I ask scrapy to press the “Next Button” to scrap all the items in the 17 pages? A sample code (just the part that matters in the crawler.py) will be greatly appreciated. Thank you for your time!

Ok. Here is my code. As I said I am new to Python so the code might look quite stupid but it works to scrap the first page (6 items). I work mostly with Fortran or Matlab. I would love to learn Python systematically If I have time though.

# Code of my crawler.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from beta.items import BetaItem

class AlphaSpider(CrawlSpider):

    name = 'alpha'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/s/ref=lp_4366_nr_p_n_publication_date_0?rh=n%3A283155%2Cn%3A%211000%2Cn%3A4366%2Cp_n_publication_date%3A1250226011&bbn=4366&ie=UTF8&qid=1384729756&rnid=1250225011']
    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//h3/a',)), callback='parse_item'), )

    def parse_item(self, response):
        sel = Selector(response)

        stuff = BetaItem()
    isbn10R = sel.xpath('//li[b[contains(text(),"ISBN-10:")]]/text()').extract()
    isbn10 = []
    if len(isbn10R) > 0:
       isbn10 = [(isbn10R[0].split(' '))[1]]
    stuff['isbn10'] = isbn10

        starsR = sel.xpath('//div[contains(@id,"averageCustomerReviews")]/span/@title').extract()
    stars = []
    if len(starsR) > 0:
       stars = [(starsR[0].split(' '))[0]]
    stuff['stars'] = stars

    reviewsR = sel.xpath('//div[contains(@id,"averageCustomerReviews")]/a[contains(@href,"showViewpoints=1")]/text()').extract()
    reviews = []
    if len(reviewsR) > 0:
       reviews = [(reviewsR[0].split(' '))[0]]
    stuff['reviews'] = reviews

    copsR = sel.xpath('//a[@class="sim-img-title"]/@href').extract()
    ncops = len(copsR)
    cops = [None] * ncops
    if ncops > 0:
       for idx, cop in enumerate(copsR):
           cops[idx]=((cop.split('dp/'))[1].split('/ref'))[0]
    stuff['cops'] = cops       

    return stuff
Asked By: maxwell

||

Answers:

I would recommend you to avoid scrapy especially since you’re a beginner.
Use awesome Requests module for downloading pages
https://github.com/kennethreitz/requests

and BeautifulSoup for parsing webpages.
http://www.crummy.com/software/BeautifulSoup/.

Answered By: Goranek

So I understand you were able to scrape these “Customers Who Bought This Item Also Bought” product details. As you probably saw, these are within a ul in a div with class “shoveler-content”:

<div id="purchaseButtonWrapper" class="shoveler-button-wrapper">
    <a class="back-button" onclick="return false;" style="" href="#Back">
    <div class="shoveler-content">
        <ul tabindex="-1">
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">
                <div id="purchase_B003LSTK8G" class="new-faceout p13nimp" data-ref="pd_sim_kstore_1" data-asin="B003LSTK8G">
                ...
                </div>
            </li>
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
        </ul>
    </div>
    <a class="next-button" onclick="return false;" style="" href="#Next">
        <span class="auiTestSprite s_shvlNext">...</span>
    </a>
    </div>
</div>

When you inspect your browser of choice’s network activity (via Firebug or Chrome Inspect tool), when you click on the “next” button for next suggested products, you’ll see an AJAX query to this sort of URL:

http://www.amazon.com
    /gp/product/features/similarities/shoveler/cell-render.html/ref=pd_sim_kstore?
    id=B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
    B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG
    &pos=7&refTag=pd_sim_kstore&wdg=ebooks_display_on_website
    &shovelerName=purchase

(I’m using this product page: http://www.amazon.com/Boomerang-Travels-New-Third-World-ebook/dp/B005CRQ2OE)

What’s in the id query argument is a list of ASINs, which are the next suggested products. 12 ASINs for 6 displayed? probably some in-page caching for the next “next” click a user will probably make.

What do you get back from this AJAX query? Still within your browser’s inspect tool, you’ll see the response is of type application/json, and the response data is a JSON array of 12 elements, each elements being some HTML snippet, similar to:

<div class="new-faceout p13nimp" id="purchase_B00261OOWQ" data-asin="B00261OOWQ" data-ref="pd_sim_kstore_7">
    <a href="/Home-Game-Accidental-Guide-Fatherhood-ebook/dp/B00261OOWQ/ref=pd_sim_kstore_7" class="sim-img-title" >
        <div class="product-image">
            <img src="http://ecx.images-amazon.com/images/I/51ZBpvGgsUL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg" width="100" alt="" height="100" border="0" /> 
        </div> Home Game: An Accidental Guide to Fatherhood
    </a> 
    <div class="byline">
        <span class="carat">&#8250</span> 
        <a href="http://www.amazon.com/Michael-Lewis/e/B000APZ33E/ref=pd_sim_kstore_bl_7">Michael Lewis</a> 
    </div> 

    <div class="rating-price"> 
        <span class="rating-stars">
            <span class="crAvgStars" style="white-space:no-wrap;">
                <span class="asinReviewsSummary" name="B00261OOWQ">
                    <a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_img_7">
                        <span class="auiTestSprite s_star_4_0 " title="4.1 out of 5 stars" >
                            <span>4.1 out of 5 stars</span>
                        </span>
                    </a>&nbsp;
                </span>
                (<a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_txt_7">99</a>)
            </span>
        </span> 
    </div> 
    <div class="binding-platform"> Kindle Edition </div> 
    <div class="pricetext"><span class="price" style="margin-right:5px">$11.36</span></div> 
</div>

So you basically get what was in the original page section for suggested products earlier, in each <li> from <div class="shoveler-content"><ul>

But how do you get those ASINs codes to append to the AJAX query’s id parameter?

Well, in the product page, you’ll notice this section

<div id="purchaseSimsData" 
    class="sims-data" style="display:none" 
    data-baseAsin="B005CRQ2OE" data-featureId="pd_sim" 
    data-pageId="B005CRQ2OEr_sim_2" data-reftag="pd_sim_kstore"
    data-wdg="ebooks_display_on_website" data-widgetName="purchase">
    B003LSTK8G,B000VKVZR6,B003E20ZRY,B000RH0C9A,B000RH0CA4,B000YMDQRS,
    B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
    B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG,
    B0018QQQKS,B002OTKEP6,B005PUWUKS,B007V65R54,B00B3VOTTI,B004EYT932,
    B002UBRFFU,B000WJSB50,B000RH0DYE,B004JXXKWY,B003E8AJXI,B008TRU7PE,
    B00555X8OA,B007OSIOWM,B00DLJIA54,B00139XTG4,B0058Z4NR8,B00ALBR6JG,
    B004H0M8QS,B003F3PL7Q,B008UX8YPC,B000U913GG,B003HOXLVQ,B000VWM0MI,
    B000SEIU28,B006VE7YS0,B008KPMBIG,B003CIQ57E,B0064EHZY0,B008UX3ITE,
    B001NLKY38,B003VIWK4C,B005GSYZRA,B007YGGOVM,B004H4X84K,B00B5ZQ72Y,
    B000R1BAH4,B008W02TIG,B000W8HC8I,B0036QVOKU,B000VRBBDC,B00APDGFOC,
    B00EOAS0EK,B000QCS888,B001QIGZEK,B0074B55IK,B000FC12C8,B00AP2XVJ0,
    B000FCK5YE,B006ID6UAW,B001FA0W5W,B005HFI0X2,B006ZOYM9K,B003SNJZ3Y,
    B00C1N5WOI,B008EKORIY,B00C4GRK4W,B004V3WRNU,B00BV6RTUG,B001AFF266,
    B00DUM1W3E,B00APDGGCS,B008WOUFIS,B008EKOO46,B008JHXO6S,B005AJM3U6,
    B00BKRW6GI,B00CDUVSQ0,B00A287PG2,B009H679WA,B000VDUWMC,B009NF6IRW
</div>

which looks like all the suggested products ASINs.

Therefore, I suggest you emulate successive AJAX queries to get suggested products, 12 ASINs at a time, decode the response using json package, and then parse each HTML snippet to extract product info you want.

Answered By: paul trmbrth
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.