Scrapy - set delay to retry middleware Scrapy - set delay to retry middleware docker docker

Scrapy - set delay to retry middleware


One way would be to add a middleware to your Spider (source, linked):

# File: middlewares.pyfrom twisted.internet import reactorfrom twisted.internet.defer import Deferredclass DelayedRequestsMiddleware(object):    def process_request(self, request, spider):        delay_s = request.meta.get('delay_request_by', None)        if not delay_s:            return        deferred = Deferred()        reactor.callLater(delay_s, deferred.callback, None)        return deferred

Which you could later use in your Spider like this:

import scrapyclass QuotesSpider(scrapy.Spider):    name = "quotes"    custom_settings = {        'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},    }    def start_requests(self):        # This request will have itself delayed by 5 seconds        yield scrapy.Request(url='http://quotes.toscrape.com/page/1/',                              meta={'delay_request_by': 5})        # This request will not be delayed        yield scrapy.Request(url='http://quotes.toscrape.com/page/2/')    def parse(self, response):        ...  # Process results here

Related method is described here: Method #2


  1. A more elaborate solution could be to set up a Kubernetes cluster in which you have multiple replicas running. This way you avoid having a failure of just 1 container impacting your scraping job.

  2. I don't think it's easy to configure a waiting time only for retries. You could play with DOWNLOAD_DELAY (but this will impact delay between all requests), or set the RETRY_TIMES to a higher value than the default of 2.