How can I use different pipelines for different spiders in a single Scrapy project
Just remove all pipelines from main settings and use this inside spider.
This will define the pipeline to user per spider
class testSpider(InitSpider): name = 'test' custom_settings = { 'ITEM_PIPELINES': { 'app.MyPipeline': 400 } }
Building on the solution from Pablo Hoffman, you can use the following decorator on the process_item
method of a Pipeline object so that it checks the pipeline
attribute of your spider for whether or not it should be executed. For example:
def check_spider_pipeline(process_item_method): @functools.wraps(process_item_method) def wrapper(self, item, spider): # message template for debugging msg = '%%s %s pipeline step' % (self.__class__.__name__,) # if class is in the spider's pipeline, then use the # process_item method normally. if self.__class__ in spider.pipeline: spider.log(msg % 'executing', level=log.DEBUG) return process_item_method(self, item, spider) # otherwise, just return the untouched item (skip this step in # the pipeline) else: spider.log(msg % 'skipping', level=log.DEBUG) return item return wrapper
For this decorator to work correctly, the spider must have a pipeline attribute with a container of the Pipeline objects that you want to use to process the item, for example:
class MySpider(BaseSpider): pipeline = set([ pipelines.Save, pipelines.Validate, ]) def parse(self, response): # insert scrapy goodness here return item
And then in a pipelines.py
file:
class Save(object): @check_spider_pipeline def process_item(self, item, spider): # do saving here return itemclass Validate(object): @check_spider_pipeline def process_item(self, item, spider): # do validating here return item
All Pipeline objects should still be defined in ITEM_PIPELINES in settings (in the correct order -- would be nice to change so that the order could be specified on the Spider, too).
The other solutions given here are good, but I think they could be slow, because we are not really not using the pipeline per spider, instead we are checking if a pipeline exists every time an item is returned (and in some cases this could reach millions).
A good way to completely disable (or enable) a feature per spider is using custom_setting
and from_crawler
for all extensions like this:
pipelines.py
from scrapy.exceptions import NotConfiguredclass SomePipeline(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): if not crawler.settings.getbool('SOMEPIPELINE_ENABLED'): # if this isn't specified in settings, the pipeline will be completely disabled raise NotConfigured return cls() def process_item(self, item, spider): # change my item return item
settings.py
ITEM_PIPELINES = { 'myproject.pipelines.SomePipeline': 300,}SOMEPIPELINE_ENABLED = True # you could have the pipeline enabled by default
spider1.py
class Spider1(Spider): name = 'spider1' start_urls = ["http://example.com"] custom_settings = { 'SOMEPIPELINE_ENABLED': False }
As you check, we have specified custom_settings
that will override the things specified in settings.py
, and we are disabling SOMEPIPELINE_ENABLED
for this spider.
Now when you run this spider, check for something like:
[scrapy] INFO: Enabled item pipelines: []
Now scrapy has completely disabled the pipeline, not bothering of its existence for the whole run. Check that this also works for scrapy extensions
and middlewares
.