Scrapy Crawl URLs in Order Scrapy Crawl URLs in Order python python

Scrapy Crawl URLs in Order


Scrapy Request has a priority attribute now.

If you have many Request in a function and want to process a particular request first, you can set:

def parse(self, response):    url = 'http://www.example.com/first'    yield Request(url=url, callback=self.parse_data, priority=1)    url = 'http://www.example.com/second'    yield Request(url=url, callback=self.parse_data)

Scrapy will process the one with priority=1 first.


start_urls defines urls which are used in start_requests method. Your parse method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse.

A solution -- override start_requests method and add to generated requests a meta with priority key. In parse extract this priority value and add it to the item. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).

Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls the first of them. In parse process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse.


The google group discussion suggests using priority attribute in Request object.Scrapy guarantees the urls are crawled in DFO by default. But it does not ensure that the urls are visited in the order they were yielded within your parse callback.

Instead of yielding Request objects you want to return an array of Requests from which objects will be popped till it is empty.

Can you try something like that?

from scrapy.spider import BaseSpiderfrom scrapy.http import Requestfrom scrapy.selector import HtmlXPathSelectorfrom mlbodds.items import MlboddsItemclass MLBoddsSpider(BaseSpider):   name = "sbrforum.com"   allowed_domains = ["sbrforum.com"]   def start_requests(self):       start_urls = reversed( [           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"       ] )       return [ Request(url = start_url) for start_url in start_urls ]   def parse(self, response):       hxs = HtmlXPathSelector(response)       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')       items = []       for site in sites:           item = MlboddsItem()           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()           items.append(item)       return items