How to create custom Scrapy Item Exporter? How to create custom Scrapy Item Exporter? json json

How to create custom Scrapy Item Exporter?


It is true that the Scrapy documentation does not clearly state where to place an Item Exporter. To use an Item Exporter, these are the steps to follow.

  1. Choose an Item Exporter class and import it to pipeline.py in the project directory. It could be a pre-defined Item Exporter (ex. XmlItemExporter) or user-defined (like FanItemExporter defined in the question)
  2. Create an Item Pipeline class in pipeline.py. Instantiate the imported Item Exporter in this class. Details will be explained in the later part of the answer.
  3. Now, register this pipeline class in settings.py file.

Following is a detailed explanation of each step. Solution to the question is included in each step.

Step 1

  • If using a pre-defined Item Exporter class, import it from scrapy.exporters module.
    Ex:from scrapy.exporters import XmlItemExporter

  • If you need a custom exporter, define a custom class in a file. I suggest placing the class in exporters.py file. Place this file in the project folder (where settings.py, items.py reside).

    While creating a new sub-class, it is always a good idea to import BaseItemExporter. It would be apt if we intend to change the functionality entirely. However, in this question, most of the functionality is close to JsonLinesItemExporter.

Hence, I am attaching two versions of the same ItemExporter. One version extends BaseItemExporter class and the other extends JsonLinesItemExporter class

Version 1: Extending BaseItemExporter

Since BaseItemExporter is the parent class, start_exporting(), finish_exporting(), export_item() must be overrided to suit our needs.

from scrapy.exporters import BaseItemExporterfrom scrapy.utils.serialize import ScrapyJSONEncoderfrom scrapy.utils.python import to_bytesclass FanItemExporter(BaseItemExporter):    def __init__(self, file, **kwargs):        self._configure(kwargs, dont_fail=True)        self.file = file        self.encoder = ScrapyJSONEncoder(**kwargs)        self.first_item = True    def start_exporting(self):        self.file.write(b'{\'product\': [')    def finish_exporting(self):        self.file.write(b'\n]}')    def export_item(self, item):        if self.first_item:            self.first_item = False        else:            self.file.write(b',\n')        itemdict = dict(self._get_serialized_fields(item))        self.file.write(to_bytes(self.encoder.encode(itemdict)))

Version 2: Extending JsonLinesItemExporter

JsonLinesItemExporter provides the exact same implementation of export_item() method. Therefore only start_exporting() and finish_exporting() methods are overrided.

Implementation of JsonLinesItemExporter can be seen in the folder python_dir\pkgs\scrapy-1.1.0-py35_0\Lib\site-packages\scrapy\exporters.py

from scrapy.exporters import JsonItemExporterclass FanItemExporter(JsonItemExporter):    def __init__(self, file, **kwargs):        # To initialize the object using JsonItemExporter's constructor        super().__init__(file)    def start_exporting(self):        self.file.write(b'{\'product\': [')    def finish_exporting(self):        self.file.write(b'\n]}')

Note: When writing data to file, it is important to note that the standard Item Exporter classes expect binary files. Hence, the file must be opened in binary mode (b). For the same reason, write() method in both the version write bytes to file.

Step 2

Creating an Item Pipeline class.

from project_name.exporters import FanItemExporterclass FanExportPipeline(object):    def __init__(self, file_name):        # Storing output filename        self.file_name = file_name        # Creating a file handle and setting it to None        self.file_handle = None    @classmethod    def from_crawler(cls, crawler):        # getting the value of FILE_NAME field from settings.py        output_file_name = crawler.settings.get('FILE_NAME')        # cls() calls FanExportPipeline's constructor        # Returning a FanExportPipeline object        return cls(output_file_name)    def open_spider(self, spider):        print('Custom export opened')        # Opening file in binary-write mode        file = open(self.file_name, 'wb')        self.file_handle = file        # Creating a FanItemExporter object and initiating export        self.exporter = FanItemExporter(file)        self.exporter.start_exporting()    def close_spider(self, spider):        print('Custom Exporter closed')        # Ending the export to file from FanItemExport object        self.exporter.finish_exporting()        # Closing the opened output file        self.file_handle.close()    def process_item(self, item, spider):        # passing the item to FanItemExporter object for expoting to file        self.exporter.export_item(item)        return item

Step 3

Since the Item Export Pipeline is defined, register this pipeline in settings.py file. Also add the field FILE_NAME to settings.py file. This field contains the filename of the output file.

Add the following lines to settings.py file.

FILE_NAME = 'path/outputfile.ext'ITEM_PIPELINES = {    'project_name.pipelines.FanExportPipeline' : 600,}

If ITEM_PIPELINES is already uncommented, then add the following line to the ITEM_PIPELINES dictionary.

'project_name.pipelines.FanExportPipeline' : 600,

This is one way to create a custom Item Export pipeline.