Scrapy - set delay to retry middleware
One way would be to add a middleware to your Spider (source, linked):
# File: middlewares.pyfrom twisted.internet import reactorfrom twisted.internet.defer import Deferredclass DelayedRequestsMiddleware(object): def process_request(self, request, spider): delay_s = request.meta.get('delay_request_by', None) if not delay_s: return deferred = Deferred() reactor.callLater(delay_s, deferred.callback, None) return deferred
Which you could later use in your Spider like this:
import scrapyclass QuotesSpider(scrapy.Spider): name = "quotes" custom_settings = { 'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123}, } def start_requests(self): # This request will have itself delayed by 5 seconds yield scrapy.Request(url='http://quotes.toscrape.com/page/1/', meta={'delay_request_by': 5}) # This request will not be delayed yield scrapy.Request(url='http://quotes.toscrape.com/page/2/') def parse(self, response): ... # Process results here
Related method is described here: Method #2
A more elaborate solution could be to set up a Kubernetes cluster in which you have multiple replicas running. This way you avoid having a failure of just 1 container impacting your scraping job.
I don't think it's easy to configure a waiting time only for retries. You could play with DOWNLOAD_DELAY (but this will impact delay between all requests), or set the RETRY_TIMES to a higher value than the default of 2.