Read timeout using either urllib2 or any other http library Read timeout using either urllib2 or any other http library python python

Read timeout using either urllib2 or any other http library


It's not possible for any library to do this without using some kind of asynchronous timer through threads or otherwise. The reason is that the timeout parameter used in httplib, urllib2 and other libraries sets the timeout on the underlying socket. And what this actually does is explained in the documentation.

SO_RCVTIMEO

Sets the timeout value that specifies the maximum amount of time an input function waits until it completes. It accepts a timeval structure with the number of seconds and microseconds specifying the limit on how long to wait for an input operation to complete. If a receive operation has blocked for this much time without receiving additional data, it shall return with a partial count or errno set to [EAGAIN] or [EWOULDBLOCK] if no data is received.

The bolded part is key. A socket.timeout is only raised if not a single byte has been received for the duration of the timeout window. In other words, this is a timeout between received bytes.

A simple function using threading.Timer could be as follows.

import httplibimport socketimport threadingdef download(host, path, timeout = 10):    content = None        http = httplib.HTTPConnection(host)    http.request('GET', path)    response = http.getresponse()        timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD])    timer.start()        try:        content = response.read()    except httplib.IncompleteRead:        pass            timer.cancel() # cancel on triggered Timer is safe    http.close()        return content>>> host = 'releases.ubuntu.com'>>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1)>>> print content is NoneTrue>>> content = download(host, '/15.04/MD5SUMS', 1)>>> print content is NoneFalse

Other than checking for None, it's also possible to catch the httplib.IncompleteRead exception not inside the function, but outside of it. The latter case will not work though if the HTTP request doesn't have a Content-Length header.


I found in my tests (using the technique described here) that a timeout set in the urlopen() call also effects the read() call:

import urllib2 as uc = u.urlopen('http://localhost/', timeout=5.0)s = c.read(1<<20)Traceback (most recent call last):  File "<stdin>", line 1, in <module>  File "/usr/lib/python2.7/socket.py", line 380, in read    data = self._sock.recv(left)  File "/usr/lib/python2.7/httplib.py", line 561, in read    s = self.fp.read(amt)  File "/usr/lib/python2.7/httplib.py", line 1298, in read    return s + self._file.read(amt - len(s))  File "/usr/lib/python2.7/socket.py", line 380, in read    data = self._sock.recv(left)socket.timeout: timed out

Maybe it's a feature of newer versions? I'm using Python 2.7 on a 12.04 Ubuntu straight out of the box.


One possible (imperfect) solution is to set the global socket timeout, explained in more detail here:

import socketimport urllib2# timeout in secondssocket.setdefaulttimeout(10)# this call to urllib2.urlopen now uses the default timeout# we have set in the socket modulereq = urllib2.Request('http://www.voidspace.org.uk')response = urllib2.urlopen(req)

However, this only works if you're willing to globally modify the timeout for all users of the socket module. I'm running the request from within a Celery task, so doing this would mess up timeouts for the Celery worker code itself.

I'd be happy to hear any other solutions...