Python 3.4 urllib.request error (http 403)

Question:

I’m trying to open and parse a html page. In python 2.7.8 I have no problem:

import urllib
url = "https://ipdb.at/ip/66.196.116.112"
html = urllib.urlopen(url).read()

and everything is fine. However I want to move to python 3.4 and there I get HTTP error 403 (Forbidden). My code:

import urllib.request
html = urllib.request.urlopen(url) # same URL as before

File "C:Python34liburllibrequest.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "C:Python34liburllibrequest.py", line 461, in open
response = meth(req, response)
File "C:Python34liburllibrequest.py", line 574, in http_response
'http', request, response, code, msg, hdrs)
File "C:Python34liburllibrequest.py", line 499, in error
return self._call_chain(*args)
File "C:Python34liburllibrequest.py", line 433, in _call_chain
result = func(*args)
File "C:Python34liburllibrequest.py", line 582, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

It work for other URLs which don’t use https.

url = 'http://www.stopforumspam.com/ipcheck/212.91.188.166'

is ok.

Asked By: Belial

||

Answers:

It seems like the site does not like the user agent of Python 3.x.

Specifying User-Agent will solve your problem:

import urllib.request
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()

NOTE Python 2.x urllib version also receives 403 status, but unlike Python 2.x urllib2 and Python 3.x urllib, it does not raise the exception.

You can confirm that by following code:

print(urllib.urlopen(url).getcode())  # => 403
Answered By: falsetru

Here are some notes I gathered on urllib when I was studying python-3:
I kept them in case they might come in handy or help someone else out.

How to import urllib.request and urllib.parse:

import urllib.request as urlRequest
import urllib.parse as urlParse

How to make a GET request:

url = "http://www.example.net"
# open the url
x = urlRequest.urlopen(url)
# get the source code
sourceCode = x.read()

How to make a POST request:

url = "https://www.example.com"
values = {"q": "python if"}
# encode values for the url
values = urlParse.urlencode(values)
# encode the values in UTF-8 format
values = values.encode("UTF-8")
# create the url
targetUrl = urlRequest.Request(url, values)
# open the url
x  = urlRequest.urlopen(targetUrl)
# get the source code
sourceCode = x.read()

How to make a POST request (403 forbidden responses):

url = "https://www.example.com"
values = {"q": "python urllib"}
# pretend to be a chrome 47 browser on a windows 10 machine
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
# encode values for the url
values = urlParse.urlencode(values)
# encode the values in UTF-8 format
values = values.encode("UTF-8")
# create the url
targetUrl = urlRequest.Request(url = url, data = values, headers = headers)
# open the url
x  = urlRequest.urlopen(targetUrl)
# get the source code
sourceCode = x.read()

How to make a GET request (403 forbidden responses):

url = "https://www.example.com"
# pretend to be a chrome 47 browser on a windows 10 machine
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
req = urlRequest.Request(url, headers = headers)
# open the url
x = urlRequest.urlopen(req)
# get the source code
sourceCode = x.read()
Answered By: user5870134

The urllib request HTTP 403 error occurs due to a server security feature that blocks known bot user-agents.
Here are possible solutions in order of feasibility (easiest to apply first):-

Solution 1:-

Add a different user-agent that’s just NOT considered a bot.

from urllib.request import Request, urlopen 
web = "https://www.festo.com/de/de" 
headers = {
   "User-Agent": "XYZ/3.0",
   "X-Requested-With": "XMLHttpRequest"
} 
request = Request(web, headers=headers) 
content = urlopen(request).read()

Optionally, you can set a short timeout for the request, if you’re running multiple requests consecutively.

content = urlopen(request,timeout=10).read()

Solution 2:-

Add a cookie from your browser after opening the url manually and accepting all cookies.

from urllib.request import Request, urlopen 
web = "https://www.festo.com/de/de" 
headers = {
   "User-Agent": "XYZ/3.0",
   "X-Requested-With": "XMLHttpRequest", 
   "cookie": "value stored in your webpage"
} 
request = Request(web, headers=headers) 
content = urlopen(request).read()

If you’re using chrome, you can log onto the web url and open the inspector (press F12), then choose the Application tab, then from the left tree choose Cookies under Storage

Solution 3:-

If obtaining the cookie needs to be done for several websites, it would be deemed wise to create the request using the Session object due to it’s compatibility with cookies.

import requests
web = "https://www.festo.com/de/de" 
headers = {
   "User-Agent": "XYZ/3.0",
   "X-Requested-With": "XMLHttpRequest"
} 
request = requests.Session()
content = request.get(web,headers=headers).text

Extra:-

If SSL certificate verification fails while using urllib

from urllib.request import Request, urlopen 
import ssl
web = "https://www.festo.com/de/de" 
headers = {
   "User-Agent": "XYZ/3.0",
   "X-Requested-With": "XMLHttpRequest"
} 
request = Request(web, headers=headers)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE 

content = urlopen(request,context=ctx).read()

Credits to the following Question 1, Question 2, SSL-Certificate

Answered By: MedoAlmasry
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.