Scrapy Crawl URLs in Order

python sorting asynchronous hashmap scrapy

Scrapy Request has a priority attribute now.

If you have many Request in a function and want to process a particular request first, you can set:

def parse(self, response):    url = 'http://www.example.com/first'    yield Request(url=url, callback=self.parse_data, priority=1)    url = 'http://www.example.com/second'    yield Request(url=url, callback=self.parse_data)

Scrapy will process the one with priority=1 first.

python sorting asynchronous hashmap scrapy

start_urls defines urls which are used in start_requests method. Your parse method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse.

A solution -- override start_requests method and add to generated requests a meta with priority key. In parse extract this priority value and add it to the item. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).

Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls the first of them. In parse process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse.

python sorting asynchronous hashmap scrapy

The google group discussion suggests using priority attribute in Request object.Scrapy guarantees the urls are crawled in DFO by default. But it does not ensure that the urls are visited in the order they were yielded within your parse callback.

Instead of yielding Request objects you want to return an array of Requests from which objects will be popped till it is empty.

Can you try something like that?

from scrapy.spider import BaseSpiderfrom scrapy.http import Requestfrom scrapy.selector import HtmlXPathSelectorfrom mlbodds.items import MlboddsItemclass MLBoddsSpider(BaseSpider):   name = "sbrforum.com"   allowed_domains = ["sbrforum.com"]   def start_requests(self):       start_urls = reversed( [           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"       ] )       return [ Request(url = start_url) for start_url in start_urls ]   def parse(self, response):       hxs = HtmlXPathSelector(response)       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')       items = []       for site in sites:           item = MlboddsItem()           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()           items.append(item)       return items

CodeHunter

Scrapy Crawl URLs in Order

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last