Read timeout using either urllib2 or any other http library
It's not possible for any library to do this without using some kind of asynchronous timer through threads or otherwise. The reason is that the timeout
parameter used in httplib
, urllib2
and other libraries sets the timeout
on the underlying socket
. And what this actually does is explained in the documentation.
SO_RCVTIMEO
Sets the timeout value that specifies the maximum amount of time an input function waits until it completes. It accepts a timeval structure with the number of seconds and microseconds specifying the limit on how long to wait for an input operation to complete. If a receive operation has blocked for this much time without receiving additional data, it shall return with a partial count or errno set to [EAGAIN] or [EWOULDBLOCK] if no data is received.
The bolded part is key. A socket.timeout
is only raised if not a single byte has been received for the duration of the timeout
window. In other words, this is a timeout
between received bytes.
A simple function using threading.Timer
could be as follows.
import httplibimport socketimport threadingdef download(host, path, timeout = 10): content = None http = httplib.HTTPConnection(host) http.request('GET', path) response = http.getresponse() timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD]) timer.start() try: content = response.read() except httplib.IncompleteRead: pass timer.cancel() # cancel on triggered Timer is safe http.close() return content>>> host = 'releases.ubuntu.com'>>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1)>>> print content is NoneTrue>>> content = download(host, '/15.04/MD5SUMS', 1)>>> print content is NoneFalse
Other than checking for None
, it's also possible to catch the httplib.IncompleteRead
exception not inside the function, but outside of it. The latter case will not work though if the HTTP request doesn't have a Content-Length
header.
I found in my tests (using the technique described here) that a timeout set in the urlopen()
call also effects the read()
call:
import urllib2 as uc = u.urlopen('http://localhost/', timeout=5.0)s = c.read(1<<20)Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/socket.py", line 380, in read data = self._sock.recv(left) File "/usr/lib/python2.7/httplib.py", line 561, in read s = self.fp.read(amt) File "/usr/lib/python2.7/httplib.py", line 1298, in read return s + self._file.read(amt - len(s)) File "/usr/lib/python2.7/socket.py", line 380, in read data = self._sock.recv(left)socket.timeout: timed out
Maybe it's a feature of newer versions? I'm using Python 2.7 on a 12.04 Ubuntu straight out of the box.
One possible (imperfect) solution is to set the global socket timeout, explained in more detail here:
import socketimport urllib2# timeout in secondssocket.setdefaulttimeout(10)# this call to urllib2.urlopen now uses the default timeout# we have set in the socket modulereq = urllib2.Request('http://www.voidspace.org.uk')response = urllib2.urlopen(req)
However, this only works if you're willing to globally modify the timeout for all users of the socket module. I'm running the request from within a Celery task, so doing this would mess up timeouts for the Celery worker code itself.
I'd be happy to hear any other solutions...