Should I create pipeline to save files with scrapy?

python scrapy web-crawler pipeline

Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.

You could save the pdf in the spider callback:

def parse_listing(self, response):    # ... extract pdf urls    for url in pdf_urls:        yield Request(url, callback=self.save_pdf)def save_pdf(self, response):    path = self.get_path(response.url)    with open(path, "wb") as f:        f.write(response.body)

If you choose to do it in a pipeline:

# in the spiderdef parse_pdf(self, response):    i = MyItem()    i['body'] = response.body    i['url'] = response.url    # you can add more metadata to the item    return i# in your pipelinedef process_item(self, item, spider):    path = self.get_path(item['url'])    with open(path, "wb") as f:        f.write(item['body'])    # remove body and add path as reference    del item['body']    item['path'] = path    # let item be processed by other pipelines. ie. db store    return item

[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)

python scrapy web-crawler pipeline

There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:

https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ

python scrapy web-crawler pipeline

It's a perfect tool for the job. The way Scrapy works is that you have spiders that transform web pages into structured data(items). Pipelines are postprocessors, but they use same asynchronous infrastructure as spiders so it's perfect for fetching media files.

In your case, you'd first extract location of PDFs in spider, fetch them in pipeline and have another pipeline to save items.

CodeHunter

Should I create pipeline to save files with scrapy?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last