Should I create pipeline to save files with scrapy? Should I create pipeline to save files with scrapy? python python

Should I create pipeline to save files with scrapy?


Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.

You could save the pdf in the spider callback:

def parse_listing(self, response):    # ... extract pdf urls    for url in pdf_urls:        yield Request(url, callback=self.save_pdf)def save_pdf(self, response):    path = self.get_path(response.url)    with open(path, "wb") as f:        f.write(response.body)

If you choose to do it in a pipeline:

# in the spiderdef parse_pdf(self, response):    i = MyItem()    i['body'] = response.body    i['url'] = response.url    # you can add more metadata to the item    return i# in your pipelinedef process_item(self, item, spider):    path = self.get_path(item['url'])    with open(path, "wb") as f:        f.write(item['body'])    # remove body and add path as reference    del item['body']    item['path'] = path    # let item be processed by other pipelines. ie. db store    return item

[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)


There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:

https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ


It's a perfect tool for the job. The way Scrapy works is that you have spiders that transform web pages into structured data(items). Pipelines are postprocessors, but they use same asynchronous infrastructure as spiders so it's perfect for fetching media files.

In your case, you'd first extract location of PDFs in spider, fetch them in pipeline and have another pipeline to save items.