How can I use different pipelines for different spiders in a single Scrapy project How can I use different pipelines for different spiders in a single Scrapy project python python

How can I use different pipelines for different spiders in a single Scrapy project


Just remove all pipelines from main settings and use this inside spider.

This will define the pipeline to user per spider

class testSpider(InitSpider):    name = 'test'    custom_settings = {        'ITEM_PIPELINES': {            'app.MyPipeline': 400        }    }


Building on the solution from Pablo Hoffman, you can use the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. For example:

def check_spider_pipeline(process_item_method):    @functools.wraps(process_item_method)    def wrapper(self, item, spider):        # message template for debugging        msg = '%%s %s pipeline step' % (self.__class__.__name__,)        # if class is in the spider's pipeline, then use the        # process_item method normally.        if self.__class__ in spider.pipeline:            spider.log(msg % 'executing', level=log.DEBUG)            return process_item_method(self, item, spider)        # otherwise, just return the untouched item (skip this step in        # the pipeline)        else:            spider.log(msg % 'skipping', level=log.DEBUG)            return item    return wrapper

For this decorator to work correctly, the spider must have a pipeline attribute with a container of the Pipeline objects that you want to use to process the item, for example:

class MySpider(BaseSpider):    pipeline = set([        pipelines.Save,        pipelines.Validate,    ])    def parse(self, response):        # insert scrapy goodness here        return item

And then in a pipelines.py file:

class Save(object):    @check_spider_pipeline    def process_item(self, item, spider):        # do saving here        return itemclass Validate(object):    @check_spider_pipeline    def process_item(self, item, spider):        # do validating here        return item

All Pipeline objects should still be defined in ITEM_PIPELINES in settings (in the correct order -- would be nice to change so that the order could be specified on the Spider, too).


The other solutions given here are good, but I think they could be slow, because we are not really not using the pipeline per spider, instead we are checking if a pipeline exists every time an item is returned (and in some cases this could reach millions).

A good way to completely disable (or enable) a feature per spider is using custom_setting and from_crawler for all extensions like this:

pipelines.py

from scrapy.exceptions import NotConfiguredclass SomePipeline(object):    def __init__(self):        pass    @classmethod    def from_crawler(cls, crawler):        if not crawler.settings.getbool('SOMEPIPELINE_ENABLED'):            # if this isn't specified in settings, the pipeline will be completely disabled            raise NotConfigured        return cls()    def process_item(self, item, spider):        # change my item        return item

settings.py

ITEM_PIPELINES = {   'myproject.pipelines.SomePipeline': 300,}SOMEPIPELINE_ENABLED = True # you could have the pipeline enabled by default

spider1.py

class Spider1(Spider):    name = 'spider1'    start_urls = ["http://example.com"]    custom_settings = {        'SOMEPIPELINE_ENABLED': False    }

As you check, we have specified custom_settings that will override the things specified in settings.py, and we are disabling SOMEPIPELINE_ENABLED for this spider.

Now when you run this spider, check for something like:

[scrapy] INFO: Enabled item pipelines: []

Now scrapy has completely disabled the pipeline, not bothering of its existence for the whole run. Check that this also works for scrapy extensions and middlewares.