Add a delay after 500 requests scrapy

Question:

I have a list of start 2000 urls and I’m using:

DOWNLOAD_DELAY = 0.25 

For controlling the speed of the requests, But I also want to add a bigger delay after n requests.
For example, I want a delay of 0.25 seconds for each request and a delay of 100 seconds each 500 requests.

Edit:

Sample code:

import os
from os.path import join
import scrapy
import time

date = time.strftime("%d/%m/%Y").replace('/','_')

list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla',                 
                 'http://runrun.es/':'runrunes',
                 'http://www.noticierodigital.com/':'noticiero_digital',
                 'http://www.eluniversal.com/':'el_universal',
                 'http://www.el-nacional.com/':'el_nacional',
                 'http://globovision.com/':'globovision',
                 'http://www.talcualdigital.com/':'talcualdigital',
                 'http://www.maduradas.com/':'maduradas',
                 'http://laiguana.tv/':'laiguana',
                 'http://www.aporrea.org/':'aporrea'}

root_dir = os.getcwd()
output_dir = join(root_dir,'data/',date)

class TestSpider(scrapy.Spider):
    name = "news_spider"
    download_delay = 1

    start_urls = list_of_pages.keys()

    def parse(self, response):
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)

        filename =   list_of_pages[response.url]
        print time.time()
        with open(join(output_dir,filename), 'wb') as f:
            f.write(response.body)

The list, in this case, is shorter yet the idea is the same. I want to have to levels of delays one for each request and one each ‘N’ requests.
I’m not crawling the links, just saving the main page.

Answers:

You can look into using an AutoThrottle extension which does not give you a tight control of the delays but instead has its own algorithm of slowing down the spider adjusting it on the fly depending on the response time and number of concurrent requests.

If you need more control over the delays at certain stages of the scraping process, you might need a custom middleware or a custom extension (similar to AutoThrottle – source).

You can also change the .download_delay attribute of your spider on the fly. By the way, this is exactly what AutoThrottle extension does under-the-hood – it updates the .download_delay value on the fly.

Some related topics:

Answered By: alecxe

Here’s a sleepy decorator I wrote that pauses after N function calls.

def sleepy(f):
    def wrapped(*args, **kwargs):
        wrapped.calls += 1
        print(f"{f.__name__} called {wrapped.calls} times")
        if wrapped.calls % 500 == 0:
            print("Sleeping...")
            sleep(20)
        return f(*args, **kwargs)

    wrapped.calls = 0

    return wrapped
Answered By: Martin H
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.