Does a multithreaded crawler in Python really speed things up?

python multithreading gil

The GIL is not held by the Python interpreter when doing network operations. If you are doing work that is network-bound (like a crawler), you can safely ignore the effects of the GIL.

On the other hand, you may want to measure your performance if you create lots of threads doing processing (after downloading). Limiting the number of threads there will reduce the effects of the GIL on your performance.

python multithreading gil

Look at how scrapy works. It can help you a lot. It doesn't use threads, but can do multiple "simultaneous" downloading, all in the same thread.

If you think about it, you have only a single network card, so parallel processing can't really help by definition.

What scrapy does is just not wait around for the response of one request before sending another. All in a single thread.

python multithreading gil

When it comes to crawling you might be better off using something event-based such as Twisted that uses non-blocking asynchronous socket operations to fetch and return data as it comes, rather than blocking on each one.

Asynchronous network operations can easily be and usually are single-threaded. Network I/O almost always has higher latency than that of CPU because you really have no idea how long a page is going to take to return, and this is where async shines because an async operation is much lighter weight than a thread.

Edit: Here is a simple example of how to use Twisted's getPage to create a simple web crawler.

CodeHunter

Does a multithreaded crawler in Python really speed things up?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last