How to handle urllib's timeout in Python 3?

Question:

First off, my problem is quite similar to this one. I would like a timeout of urllib.urlopen() to generate an exception that I can handle.

Doesn’t this fall under URLError?

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except (HTTPError, URLError) as error:
    logging.error(
        'Data of %s not retrieved because %snURL: %s', name, error, url)
else:
    logging.info('Access successful.')

The error message:

resp = urllib.request.urlopen(req, timeout=10).read().decode(‘utf-8’)
File “/usr/lib/python3.2/urllib/request.py”, line 138, in urlopen
return opener.open(url, data, timeout)
File “/usr/lib/python3.2/urllib/request.py”, line 369, in open
response = self._open(req, data)
File “/usr/lib/python3.2/urllib/request.py”, line 387, in _open
‘_open’, req)
File “/usr/lib/python3.2/urllib/request.py”, line 347, in _call_chain
result = func(*args)
File “/usr/lib/python3.2/urllib/request.py”, line 1156, in http_open
return self.do_open(http.client.HTTPConnection, req)
File “/usr/lib/python3.2/urllib/request.py”, line 1141, in do_open
r = h.getresponse()
File “/usr/lib/python3.2/http/client.py”, line 1046, in getresponse
response.begin()
File “/usr/lib/python3.2/http/client.py”, line 346, in begin
version, status, reason = self._read_status()
File “/usr/lib/python3.2/http/client.py”, line 308, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), “iso-8859-1”)
File “/usr/lib/python3.2/socket.py”, line 276, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out

There was a major change from in Python 3 when they re-organised the urllib and urllib2 modules into urllib. Is it possible that there was a change then that causes this?

Asked By: nindalf

||

Answers:

Catch the different exceptions with explicit clauses, and check the reason for the exception with URLError (thank you RĂ©gis B.)

from socket import timeout
try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
    logging.error('HTTP Error: Data of %s not retrieved because %snURL: %s', name, error, url)
except URLError as error:
    if isinstance(error.reason, timeout):
        logging.error('Timeout Error: Data of %s not retrieved because %snURL: %s', name, error, url)
    else:
        logging.error('URL Error: Data of %s not retrieved because %snURL: %s', name, error, url)
else:
    logging.info('Access successful.')

NB For recent comments, the original post referenced python 3.2 where you needed to catch timeout errors explicitly with socket.timeout. For example



    # Warning - python 3.2 code
    from socket import timeout
    
    try:
        response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
    except timeout:
        logging.error('socket timed out - URL %s', url)

Answered By: danodonovan

The previous answer does not correctly intercept timeout errors. Timeout errors are raised as URLError, so if we want to specifically catch them, we need to write:

from urllib.error import HTTPError, URLError
import socket

try:
    response = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except HTTPError as error:
    logging.error('Data not retrieved because %snURL: %s', error, url)
except URLError as error:
    if isinstance(error.reason, socket.timeout):
        logging.error('socket timed out - URL %s', url)
    else:
        logging.error('some other error happened)
else:
    logging.info('Access successful.')

Note that ValueError can independently be raised, i.e. if the URL is invalid. Like HTTPError, it is not associated with a timeout.

Answered By: Régis B.

What is a "timeout"? Holistically I think it means "a situation where the server didn’t respond in time, typically because of high load, and it’s worth retrying again."

HTTP status 504 "gateway timeout" would be a timeout under this definition. It’s delivered via HTTPError.

HTTP status 429 "too many requests" would also be a timeout under that definition. It too is delivered via HTTPError.

Otherwise, what do we mean by a timeout? Do we include timeouts in resolving the domain name via the DNS resolver? timeouts when trying to send data? timeouts when waiting for the data to come back?

I don’t know how to audit the source code of urllib to be sure that every possible way that I might consider a timeout, is being raised in a way that I’d catch. In a language without checked exceptions, I don’t know how. I have a hunch that maybe connect-to-dns errors might be coming back as socket.timeout, and connect-to-remote-server errors might be coming back as URLError(socket.timeout)? It’s just a guess that might explain earlier observations.

So I fell back to some really defensive coding. (1) I’m handling some HTTP status codes that are indicative of timeouts. (2) There are reports that some timeouts come via socket.timeout exceptions, and some via URLError(socket.timeout) exceptions, so I’m catching both. (3) And just in case, I threw in HTTPError(socket.timeout) as well.

while True:
    reason : Optional[str] = None
    try:
        with urllib.request.urlopen(url) as response:
            content = response.read()
            with open(cache,"wb") as file:
                file.write(content)
            return content
    except urllib.error.HTTPError as e:
        if e.code == 429 or e.code == 504: # 429=too many requests, 504=gateway timeout
            reason = f'{e.code} {str(e.reason)}'
        elif isinstance(e.reason, socket.timeout):
            reason = f'HTTPError socket.timeout {e.reason} - {e}'
        else:
            raise
    except urllib.error.URLError as e:
        if isinstance(e.reason, socket.timeout):
            reason = f'URLError socket.timeout {e.reason} - {e}'
        else:
            raise
    except socket.timeout as e:
        reason = f'socket.timeout {e}'
    except:
        raise
    netloc = urllib.parse.urlsplit(url).netloc # e.g. nominatim.openstreetmap.org
    print(f'*** {netloc} {reason}; will retry', file=sys.stderr)
    time.sleep(5)
Answered By: Lucian Wischik
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.