Python follow redirects and then download the page?

Question:

I have the following python script and it works beautifully.

import urllib2

url = 'http://abc.com' # write the url here

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

print data

however, some of the URL’s I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data.
For instance when using the above code with

http://www.google.com/search?hl=en&q=KEYWORD&btnI=1

which is the equvilant of hitting the im lucky button on a google search, I get:

>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>> 

Ive tried the (url, data, timeout) however, I am unsure what to put there.

EDIT:
I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link

Asked By: Cripto

||

Answers:

You might be better off with Requests library which has better APIs for controlling redirect handling:

https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history

Requests:

https://pypi.org/project/requests/ (urllib replacement for humans)

Answered By: Mikko Ohtamaa

Use requests as the other answer states, here is an example. The redirect will be in r.url. In the example below the http is redirected to https

For HEAD:

In [1]: import requests
   ...: r = requests.head('http://github.com', allow_redirects=True)
   ...: r.url

Out[1]: 'https://github.com/'

For GET:

In [1]: import requests
   ...: r = requests.get('http://github.com')
   ...: r.url

Out[1]: 'https://github.com/'

Note for HEAD you have to specify allow_redirects, if you don’t you can get it in the headers but this is not advised.

In [1]: import requests

In [2]: r = requests.head('http://github.com')

In [3]: r.headers.get('location')
Out[3]: 'https://github.com/'

To download the page you will need GET, you can then access the page using r.content

Answered By: Glen Thompson
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.