Python Program Scraping Different Text Despite Webpage Not Changing

Question:

This code attempts to scrape an Amazon listing to check its availability through the first party Amazon supplier.

from lxml import html
from time import sleep
import requests
import time

Amazonurl = raw_input("Item URL: ")

page = requests.get(Amazonurl)
tree = html.fromstring(page.text)

Stock = tree.xpath('//*[@id="merchant-info"]/text()')
IfInstock = ''.join(Stock)


if 'Ships from and sold by Amazon.com.' in IfInstock:
    print 'Instock'
    print time.strftime("%a, %d %b %Y %H:%M:%S")

else:
    print 'Not in Stock'
    print time.strftime("%a, %d %b %Y %H:%M:%S")

Oddly enough, when I plug in, say, http://www.amazon.com/New-Nintendo-3DS-XL-Black/dp/B00S1LRX3W/ref=sr_1_1?ie=UTF8&qid=1438413018&sr=8-1&keywords=new+3ds which has not gone out of stock for the last few days, sometimes the code will return “Instock”, while other times, it will return “Not in stock”. I found this to be because the code every so often scrapes

[]

while other times, it scrapes the following, as it should.

['n    n    nn    n        n        n    n    n    n    n    n    n    n    n    n    n    n    n    n        Ships from and sold by Amazon.com.n    n    n        n        n        n        n        n        n        Gift-wrap available.n        nn']

The webpage does not seem to be changing. Does anyone know why my output often varies, and perhaps an explanation on how I can fix this issue? Thanks in advance.

Asked By: The Novice

||

Answers:

Amazon is refusing to serve you this page.

I just added a line of code to your script just to see what is the status_code of the response when you get your odd outcome.

from lxml import html
from time import sleep
import requests
import time

Amazonurl = "http://www.amazon.com/dp/B00S1LRX3W/?tag=stackoverfl08-20"
intent = 0
while True:
    page = requests.get(Amazonurl)
    tree = html.fromstring(page.text)

    print(page.status_code)

    Stock = tree.xpath('//*[@id="merchant-info"]/text()')
    IfInstock = ''.join(Stock)

    if 'Ships from and sold by Amazon.com.' in IfInstock:
        print('Instock')
        print(time.strftime("%a, %d %b %Y %H:%M:%S"))

    else:
        print('Not in Stock')
        print(time.strftime("%a, %d %b %Y %H:%M:%S"))

    time.sleep(15)

    if intent>15:
        break
    intent += 1

I ran this script with a time interval of 15 seconds just as you said you did. Here’s the outcome:

200
Instock
Sat, 01 Aug 2015 19:51:27
200
Instock
Sat, 01 Aug 2015 19:51:43
503
Not in Stock
Sat, 01 Aug 2015 19:51:59
200
Instock
Sat, 01 Aug 2015 19:52:15
200
Instock
Sat, 01 Aug 2015 19:52:32
200
Instock
Sat, 01 Aug 2015 19:52:48
200
Instock
Sat, 01 Aug 2015 19:53:05
200
Instock
Sat, 01 Aug 2015 19:53:22
200
Instock
Sat, 01 Aug 2015 19:53:38
200
Instock
Sat, 01 Aug 2015 19:53:55
200
Instock
Sat, 01 Aug 2015 19:54:12
200
Instock
Sat, 01 Aug 2015 19:54:29
200
Instock
Sat, 01 Aug 2015 19:54:45
200
Instock
Sat, 01 Aug 2015 19:55:02
200
Instock
Sat, 01 Aug 2015 19:55:18
200
Instock
Sat, 01 Aug 2015 19:55:35
200
Instock
Sat, 01 Aug 2015 19:55:52

You can see that when the outcome is odd or “Not in Stock” the status_code is 503. The definition of this according to http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html is the following:

10.5.4 503 Service Unavailable
The server is currently unable to handle the request due to a
temporary overloading or maintenance of the server. The implication is
that this is a temporary condition which will be alleviated after some
delay. If known, the length of the delay MAY be indicated in a
Retry-After header. If no Retry-After is given, the client SHOULD
handle the response as it would for a 500 response.

  Note: The existence of the 503 status code does not imply that a
  server must use it when becoming overloaded. Some servers may wish
  to simply refuse the connection.

That being said, Amazon is not serving you this page because you’re making several request in a short time. That “short” time isn’t actually that demanding for Amazon and that’s why you get most of the time a 200 status_code.

I hope that answer your question. Now, If you really want to scrap sites like Amazon I would recommend you to use Scrapy which is pretty easy to use and easy to configure. You can get away with sites like Amazon by using random user-agent. But of course, this is just an add-on to your original question.

Answered By: gglasses