Scrapy, Python: Multiple Item Classes in one pipeline? Scrapy, Python: Multiple Item Classes in one pipeline? python python

Scrapy, Python: Multiple Item Classes in one pipeline?


By default every item goes through every pipeline.

For instance, if you yield a ProfileItem and a CommentItem, they'll both go through all pipelines. If you have a pipeline setup to tracks item types, then your process_item method could look like:

def process_item(self, item, spider):    self.stats.inc_value('typecount/%s' % type(item).__name__)    return item

When a ProfileItem comes through, 'typecount/ProfileItem' is incremented. When a CommentItem comes through, 'typecount/CommentItem' is incremented.

You can have one pipeline handle only one type of item request, though, if handling that item type is unique, by checking the item type before proceeding:

def process_item(self, item, spider):    if not isinstance(item, ProfileItem):        return item    # Handle your Profile Item here.

If you had the two process_item methods above setup in different pipelines, the item will go through both of them, being tracked and being processed (or ignored on the second one).

Additionally you could have one pipeline setup to handle all 'related' items:

def process_item(self, item, spider):    if isinstance(item, ProfileItem):        return self.handleProfile(item, spider)    if isinstance(item, CommentItem):        return self.handleComment(item, spider)def handleComment(item, spider):    # Handle Comment here, return itemdef handleProfile(item, spider):    # Handle profile here, return item

Or, you could make it even more complex and develop a type delegation system that loads classes and calls default handler methods, similar to how Scrapy handles middleware/pipelines. It's really up to you how complex you need it, and what you want to do.


Defining multiple Items it's a tricky thing when you are exporting your data if they have a relation (Profile 1 -- N Comments for instance) and you have to export them together because each item in processed at different times by the pipelines. An alternative approach for this scenario is to define a Custom Scrapy Field for example:

class CommentItem(scrapy.Item):    profile = ProfileField()class ProfileField(scrapy.item.Field):   # your business here

But given the scenario where you MUST have 2 items, it is highly suggested to use a different pipeline for each one of this types of items and also different exporter instances so that you get this information in different files (if you are using files):

settings.py

ITEM_PIPELINES = {    'pipelines.CommentsPipeline': 1,    'pipelines.ProfilePipeline': 1,}

pipelines.py

class CommentsPipeline(object):    def process_item(self, item, spider):        if isinstance(item, CommentItem):           # Your business hereclass ProfilePipeline(object):    def process_item(self, item, spider):        if isinstance(item, ProfileItem):           # Your business here


@Rejected answer was the solution, but it needed some tweaks before it would work for me so sharing here. This is my pipeline.py:

from .items import MyFirstItem, MySecondItem # needed import of Items    def process_item(self, item, spider):        if isinstance(item, MyFirstItem):            return self.handlefirstitem(item, spider)         if isinstance(item, MySecondItem):            return self.handleseconditem(item, spider)    def handlefirstitem(self, item, spider): # needed self added        self.storemyfirst_db(item) # function to pipe it to database table        return item    def handleseconditem(self, item, spider): # needed self added        self.storemysecond_db(item) # function to pipe it to database table        return item