Python follow redirects and then download the page?

Question

I have the following python script and it works beautifully.

import urllib2

url = 'http://abc.com' # write the url here

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

print data

however, some of the URL’s I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data.
For instance when using the above code with

http://www.google.com/search?hl=en&q=KEYWORD&btnI=1

which is the equvilant of hitting the im lucky button on a google search, I get:

>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>>

Ive tried the (url, data, timeout) however, I am unsure what to put there.

EDIT:
I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link

Asked By: Cripto

||

Source

Answer 1

You might be better off with Requests library which has better APIs for controlling redirect handling:

https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history

Requests:

https://pypi.org/project/requests/ (urllib replacement for humans)

Answered By: Mikko Ohtamaa

Answer 2

Use requests as the other answer states, here is an example. The redirect will be in r.url. In the example below the http is redirected to https

For HEAD:

In [1]: import requests
   ...: r = requests.head('http://github.com', allow_redirects=True)
   ...: r.url

Out[1]: 'https://github.com/'

For GET:

In [1]: import requests
   ...: r = requests.get('http://github.com')
   ...: r.url

Out[1]: 'https://github.com/'

Note for HEAD you have to specify allow_redirects, if you don’t you can get it in the headers but this is not advised.

In [1]: import requests

In [2]: r = requests.head('http://github.com')

In [3]: r.headers.get('location')
Out[3]: 'https://github.com/'

To download the page you will need GET, you can then access the page using r.content

Answered By: Glen Thompson

Python follow redirects and then download the page?

Question:

Answers: