Is Scrapy compatible with multiprocessing?

The recommended way for working with scrapy is to NOT use multiprocessing inside the running spiders.

The better alternative would be to invoke several scrapy jobs with the respective separated inputs.

Scrapy jobs themselves are very fast IMO, of course, you can always go faster, special settings as you mentioned CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY, etc. But this is basically because scrapy is asynchronous, meaning it won't wait for the requests to be completed to schedule and continue working on the remaining tasks (scheduling more requests, parsing responses, etc.)

The CONCURRENT_REQUESTS doesn't have a connection with multiprocessing. It is mostly a way to "limit" the speed of how many requests could be scheduled, because of being asynchronous.

python selenium scrapy

You can use:

CONCURRENT_ITEMS to configure the item processing concurrency,
CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS_PER_IP to configure the HTTP requests concurrency

If you need more than that or you have some heavy processing, I suggest that you move this part in a separate process.

Scrapy's responsibility is web parsing, you could for example, in an item pipeline, send tasks to a queue and have a separate process consume and process tasks.

python selenium scrapy

Well, typically speaking, scrapy don't support multiprocess, see

ReactorNotRestartable error in while loop with scrapy

For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.

But, there is some way to workaround.

    pool = Pool(processes=pool_size,maxtasksperchild=1)

maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process.

since the maxtasksperchild is set to 1, so the subprocess will be killed after task finished, a new subprocess will be created and no need to restart task.

But this will cause memory pressure, make sure you do need it.I think start multiply process is a better choice.

I am new to scrapy, so if you have any better suggestions, plz tell me.

CodeHunter

Is Scrapy compatible with multiprocessing?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last