A very simple multithreading parallel URL fetching (without queue)

python multithreading callback python-multithreading urlfetch

Simplifying your original version as far as possible:

import threadingimport urllib2import timestart = time.time()urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]def fetch_url(url):    urlHandler = urllib2.urlopen(url)    html = urlHandler.read()    print "'%s\' fetched in %ss" % (url, (time.time() - start))threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls]for thread in threads:    thread.start()for thread in threads:    thread.join()print "Elapsed Time: %s" % (time.time() - start)

The only new tricks here are:

Keep track of the threads you create.
Don't bother with a counter of threads if you just want to know when they're all done; join already tells you that.
If you don't need any state or external API, you don't need a Thread subclass, just a target function.

python multithreading callback python-multithreading urlfetch

multiprocessing has a thread pool that doesn't start other processes:

#!/usr/bin/env pythonfrom multiprocessing.pool import ThreadPoolfrom time import time as timerfrom urllib2 import urlopenurls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]def fetch_url(url):    try:        response = urlopen(url)        return url, response.read(), None    except Exception as e:        return url, None, estart = timer()results = ThreadPool(20).imap_unordered(fetch_url, urls)for url, html, error in results:    if error is None:        print("%r fetched in %ss" % (url, timer() - start))    else:        print("error fetching %r: %s" % (url, error))print("Elapsed Time: %s" % (timer() - start,))

The advantages compared to Thread-based solution:

ThreadPool allows to limit the maximum number of concurrent connections (20 in the code example)
the output is not garbled because all output is in the main thread
errors are logged
the code works on both Python 2 and 3 without changes (assuming from urllib.request import urlopen on Python 3).

python multithreading callback python-multithreading urlfetch

The main example in the concurrent.futures does everything you want, a lot more simply. Plus, it can handle huge numbers of URLs by only doing 5 at a time, and it handles errors much more nicely.

Of course this module is only built in with Python 3.2 or later… but if you're using 2.5-3.1, you can just install the backport, futures, off PyPI. All you need to change from the example code is to search-and-replace concurrent.futures with futures, and, for 2.x, urllib.request with urllib2.

Here's the sample backported to 2.x, modified to use your URL list and to add the times:

import concurrent.futuresimport urllib2import timestart = time.time()urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]# Retrieve a single page and report the url and contentsdef load_url(url, timeout):    conn = urllib2.urlopen(url, timeout=timeout)    return conn.readall()# We can use a with statement to ensure threads are cleaned up promptlywith concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:    # Start the load operations and mark each future with its URL    future_to_url = {executor.submit(load_url, url, 60): url for url in urls}    for future in concurrent.futures.as_completed(future_to_url):        url = future_to_url[future]        try:            data = future.result()        except Exception as exc:            print '%r generated an exception: %s' % (url, exc)        else:            print '"%s" fetched in %ss' % (url,(time.time() - start))print "Elapsed Time: %ss" % (time.time() - start)

But you can make this even simpler. Really, all you need is:

def load_url(url):    conn = urllib2.urlopen(url, timeout)    data = conn.readall()    print '"%s" fetched in %ss' % (url,(time.time() - start))    return data    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:    pages = executor.map(load_url, urls)print "Elapsed Time: %ss" % (time.time() - start)

CodeHunter

A very simple multithreading parallel URL fetching (without queue)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last