Should I create pipeline to save files with scrapy?
Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.
You could save the pdf in the spider callback:
def parse_listing(self, response): # ... extract pdf urls for url in pdf_urls: yield Request(url, callback=self.save_pdf)def save_pdf(self, response): path = self.get_path(response.url) with open(path, "wb") as f: f.write(response.body)
If you choose to do it in a pipeline:
# in the spiderdef parse_pdf(self, response): i = MyItem() i['body'] = response.body i['url'] = response.url # you can add more metadata to the item return i# in your pipelinedef process_item(self, item, spider): path = self.get_path(item['url']) with open(path, "wb") as f: f.write(item['body']) # remove body and add path as reference del item['body'] item['path'] = path # let item be processed by other pipelines. ie. db store return item
[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget
)
There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:
https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ
It's a perfect tool for the job. The way Scrapy works is that you have spiders that transform web pages into structured data(items). Pipelines are postprocessors, but they use same asynchronous infrastructure as spiders so it's perfect for fetching media files.
In your case, you'd first extract location of PDFs in spider, fetch them in pipeline and have another pipeline to save items.