"SSL: certificate_verify_failed" error when scraping https://www.thenewboston.com/
Question:
So I started learning Python recently using “The New Boston’s” videos on youtube, everything was going great until I got to his tutorial of making a simple web crawler. While I understood it with no problem, when I run the code I get errors all seemingly based around “SSL: CERTIFICATE_VERIFY_FAILED.” I’ve been searching for an answer since last night trying to figure out how to fix it, it seems no one else in the comments on the video or on his website are having the same problem as me and even using someone elses code from his website I get the same results. I’ll post the code from the one I got from the website as it’s giving me the same error and the one I coded is a mess right now.
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=" + str(page) #this is page of popular posts
source_code = requests.get(url)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
for link in soup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
href = "https://www.thenewboston.com/" + link.get('href')
title = link.string # just the text, not the HTML
print(href)
print(title)
# get_single_item_data(href)
page += 1
trade_spider(1)
The full error is: ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
I apologize if this is a dumb question, I’m still new to programming but I seriously can’t figure this out, I was thinking about just skipping this tutorial but it’s bothering me not being able to fix this, thanks!
Answers:
I’m posting this as an answer because I’ve gotten past your issue thus far, but there’s still issues in your code (which when fixed, I can update).
So long story short: you could be using an old version of requests or the ssl certificate should be invalid. There’s more information in this SO question: Python requests "certificate verify failed"
I’ve updated the code into my own bsoup.py
file:
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=" + str(page) #this is page of popular posts
source_code = requests.get(url, timeout=5, verify=False)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
for link in BeautifulSoup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
href = "https://www.thenewboston.com/" + link.get('href')
title = link.string # just the text, not the HTML
print(href)
print(title)
# get_single_item_data(href)
page += 1
if __name__ == "__main__":
trade_spider(1)
When I run the script, it gives me this error:
https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=1
Traceback (most recent call last):
File "./bsoup.py", line 26, in <module>
trade_spider(1)
File "./bsoup.py", line 16, in trade_spider
for link in BeautifulSoup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
File "/usr/local/lib/python3.4/dist-packages/bs4/element.py", line 1256, in find_all
generator = self.descendants
AttributeError: 'str' object has no attribute 'descendants'
There’s an issue somewhere with your findAll
method. I’ve used both python3 and python2, wherein python2 reports this:
TypeError: unbound method find_all() must be called with BeautifulSoup instance as first argument (got str instance instead)
So it looks like you’ll need to fix up that method before you can continue
You can tell requests not to verify the SSL certificate:
>>> url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=1"
>>> response = requests.get(url, verify=False)
>>> response.status_code
200
See more in the requests
doc
The problem is not in your code but in the web site you are trying to access. When looking at the analysis by SSLLabs you will note:
This server’s certificate chain is incomplete. Grade capped to B.
This means that the server configuration is wrong and that not only python but several others will have problems with this site. Some desktop browsers work around this configuration problem by trying to load the missing certificates from the internet or fill in with cached certificates. But other browsers or applications will fail too, similar to python.
To work around the broken server configuration you might explicitly extract the missing certificates and add them to you trust store. Or you might give the certificate as trust inside the verify argument. From the documentation:
You can pass verify the path to a CA_BUNDLE file or directory with
certificates of trusted CAs:
>>> requests.get('https://github.com', verify='/path/to/certfile')
This list of trusted CAs can also be specified through the
REQUESTS_CA_BUNDLE environment variable.
You are probably missing the stock certificates in your system. E.g. if running on Ubuntu, check that ca-certificates
package is installed.
if you want to use the Python dmg installer, you also have to read Python 3’s ReadMe and run the bash command to get new certificates.
Try running
/Applications/Python 3.6/Install Certificates.command
I spent several hours trying to fix some Python and update certs on a VM. In my case I was working against a server that someone else had set up. It turned out that the wrong cert had been uploaded to the server. I found this command on another SO answer.
root@ubuntu:~/cloud-tools# openssl s_client -connect abc.def.com:443
CONNECTED(00000005)
depth=0 OU = Domain Control Validated, CN = abc.def.com
verify error_num=20:unable to get local issuer certificate
verify return:1
depth=0 OU = Domain Control Validated, CN = abc.def.com
verify error_num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
0 s:OU = Domain Control Validated, CN = abc.def.com
i:C = US, ST = Arizona, L = Scottsdale, O = "GoDaddy.com, Inc.", OU = http://certs.godaddy.com/repository/, CN = Go Daddy Secure Certificate Authority - G2
It’s worth shedding a bit more "hands-on" light about what happens here, adding upon @Steffen Ullrich’s answer here and elsewhere:
- urllib and “SSL: CERTIFICATE_VERIFY_FAILED” Error
- Python Urllib2 SSL error (a very detailed answer)
Notes:
- I’ll use another website than the OP, because the OP’s website currently has no issues.
- I used Ubunto to run the following commands (
curl
and openssl
). I tried running curl
on my Windows 10, but got different, unhelpful output.
The error experienced by the OP can be "reproduced" by using the following curl
command:
curl -vvI https://www.vimmi.net
Which outputs (note the last line):
* TCP_NODELAY set
* Connected to www.vimmi.net (82.80.192.7) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS alert, Server hello (2):
* SSL certificate problem: unable to get local issuer certificate
* stopped the pause stream!
* Closing connection 0
curl: (60) SSL certificate problem: unable to get local issuer certificate
Now let’s run it with the --insecure
flag, which will display the problematic certificate:
curl --insecure -vvI https://www.vimmi.net
Outputs (note the last two lines):
* Rebuilt URL to: https://www.vimmi.net/
* Trying 82.80.192.7...
* TCP_NODELAY set
* Connected to www.vimmi.net (82.80.192.7) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* [...]
* Server certificate:
* subject: OU=Domain Control Validated; CN=vimmi.net
* start date: Aug 5 15:43:45 2019 GMT
* expire date: Oct 4 16:16:12 2020 GMT
* issuer: C=US; ST=Arizona; L=Scottsdale; O=GoDaddy.com, Inc.; OU=http://certs.godaddy.com/repository/; CN=Go Daddy Secure Certificate Authority - G2
* SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
The same result can be seen using openssl
, which is worth mentioning because it’s used internally by python:
echo | openssl s_client -connect vimmi.net:443
Outputs:
CONNECTED(00000005)
depth=0 OU = Domain Control Validated, CN = vimmi.net
verify error_num=20:unable to get local issuer certificate
verify return:1
depth=0 OU = Domain Control Validated, CN = vimmi.net
verify error_num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
0 s:OU = Domain Control Validated, CN = vimmi.net
i:C = US, ST = Arizona, L = Scottsdale, O = "GoDaddy.com, Inc.", OU = http://certs.godaddy.com/repository/, CN = Go Daddy Secure Certificate Authority - G2
---
Server certificate
-----BEGIN CERTIFICATE-----
[...]
-----END CERTIFICATE-----
[...]
---
DONE
So why both curl
and openssl
can’t verify the certificate Go Daddy issued for that website?
Well, to "verify a certificate" (to use openssl’s error message terminology) means to verify that the certificate contains a trusted source signature (put differently: the certificate was signed by a trusted source), thus verifying vimmi.net
identity ("identity" here strictly means that "the public key contained in the certificate belongs to the person, organization, server or other entity noted in the certificate").
A source is "trusted" if we can establish its "chain of trust", with the following properties:
- The Issuer of each certificate (except the last one) matches the Subject of the next certificate in the list
- Each certificate (except the last one) is signed by the secret key corresponding to the next certificate in the chain (i.e. the signature
of one certificate can be verified using the public key contained in
the following certificate)
- The last certificate in the list is a trust anchor: a certificate that you trust because it was delivered to you by some trustworthy
procedure
In our case, the issuer is "Go Daddy Secure Certificate Authority – G2". That is, the entity named "Go Daddy Secure Certificate Authority – G2" signed the certificate, so it’s supposed to be a trusted source.
To establish this entity’s trustworthiness, we have 2 options:
-
Assume that "Go Daddy Secure Certificate Authority – G2" is a "trust anchor" (see listing 3 above). Well, it turns out that curl
and openssl
try to act upon this assumption: they searched that entity’s certificate on their default paths (called CA paths), which are:
- for
curl
, it’s /etc/ssl/certs
.
- for
openssl
, it’s /use/lib/ssl
(run openssl version -a
to see that).
But that certificate wasn’t found, leaving us with a second option:
- Follow steps 1 and 2 listed above; in order to do that, we need to get the certificate issued for that entity.
This can be achieved by downloading it from its source, or using the browser.
- for example, go to
vimmi.net
using Chrome, click the padlock > "Certificate" > "Certification Path" tab, select the entity > "View Certificate", then in the opened window go to "Details" tab > "Copy to File" > Base-64 encoded > save the file)
Great! Now that we have that certificate (which can be in whatever file format: cer
, pem
, etc.; you can even save it as a txt
file), let’s tell curl
to use it:
curl --cacert test.cer https://vimmi.net
Going back to Python
Once we have:
- "Go Daddy Secure Certificate Authority – G2" certificate
- "Go Daddy Root Certificate Authority – G2" certificate (wasn’t mentioned above, but can be achieved in a similar way).
We need to copy their contents into a single file, let’s call it combined.cer
, and let’s put it in the current directory. Then, simply:
import requests
res = requests.get("https://vimmi.net", verify="./combined.cer")
print (res.status_code) # 200
- BTW, "Go Daddy Root Certificate Authority – G2" is listed as a trusted authority by browsers and various tools; that’s why we didn’t have to specify it for
curl
.
Further reading:
- how are ssl certificates verified, especially @ychaouche image.
- The First Few Milliseconds of an HTTPS Connection
- Wikipedia: Public key certificate, Certificate authority
- Nice video: Basics of Certificate Chain Validation.
- Helpful SE answers that focus on certificate signature terminology: 1, 2, 3.
- Certificates in relation to Man-In-The-Middle attack: 1, 2.
- The most dangerous code in the world: validating SSL certificates in non-browser software
So I started learning Python recently using “The New Boston’s” videos on youtube, everything was going great until I got to his tutorial of making a simple web crawler. While I understood it with no problem, when I run the code I get errors all seemingly based around “SSL: CERTIFICATE_VERIFY_FAILED.” I’ve been searching for an answer since last night trying to figure out how to fix it, it seems no one else in the comments on the video or on his website are having the same problem as me and even using someone elses code from his website I get the same results. I’ll post the code from the one I got from the website as it’s giving me the same error and the one I coded is a mess right now.
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=" + str(page) #this is page of popular posts
source_code = requests.get(url)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
for link in soup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
href = "https://www.thenewboston.com/" + link.get('href')
title = link.string # just the text, not the HTML
print(href)
print(title)
# get_single_item_data(href)
page += 1
trade_spider(1)
The full error is: ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
I apologize if this is a dumb question, I’m still new to programming but I seriously can’t figure this out, I was thinking about just skipping this tutorial but it’s bothering me not being able to fix this, thanks!
I’m posting this as an answer because I’ve gotten past your issue thus far, but there’s still issues in your code (which when fixed, I can update).
So long story short: you could be using an old version of requests or the ssl certificate should be invalid. There’s more information in this SO question: Python requests "certificate verify failed"
I’ve updated the code into my own bsoup.py
file:
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=" + str(page) #this is page of popular posts
source_code = requests.get(url, timeout=5, verify=False)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
for link in BeautifulSoup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
href = "https://www.thenewboston.com/" + link.get('href')
title = link.string # just the text, not the HTML
print(href)
print(title)
# get_single_item_data(href)
page += 1
if __name__ == "__main__":
trade_spider(1)
When I run the script, it gives me this error:
https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=1
Traceback (most recent call last):
File "./bsoup.py", line 26, in <module>
trade_spider(1)
File "./bsoup.py", line 16, in trade_spider
for link in BeautifulSoup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
File "/usr/local/lib/python3.4/dist-packages/bs4/element.py", line 1256, in find_all
generator = self.descendants
AttributeError: 'str' object has no attribute 'descendants'
There’s an issue somewhere with your findAll
method. I’ve used both python3 and python2, wherein python2 reports this:
TypeError: unbound method find_all() must be called with BeautifulSoup instance as first argument (got str instance instead)
So it looks like you’ll need to fix up that method before you can continue
You can tell requests not to verify the SSL certificate:
>>> url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=1"
>>> response = requests.get(url, verify=False)
>>> response.status_code
200
See more in the requests
doc
The problem is not in your code but in the web site you are trying to access. When looking at the analysis by SSLLabs you will note:
This server’s certificate chain is incomplete. Grade capped to B.
This means that the server configuration is wrong and that not only python but several others will have problems with this site. Some desktop browsers work around this configuration problem by trying to load the missing certificates from the internet or fill in with cached certificates. But other browsers or applications will fail too, similar to python.
To work around the broken server configuration you might explicitly extract the missing certificates and add them to you trust store. Or you might give the certificate as trust inside the verify argument. From the documentation:
You can pass verify the path to a CA_BUNDLE file or directory with
certificates of trusted CAs:>>> requests.get('https://github.com', verify='/path/to/certfile')
This list of trusted CAs can also be specified through the
REQUESTS_CA_BUNDLE environment variable.
You are probably missing the stock certificates in your system. E.g. if running on Ubuntu, check that ca-certificates
package is installed.
if you want to use the Python dmg installer, you also have to read Python 3’s ReadMe and run the bash command to get new certificates.
Try running
/Applications/Python 3.6/Install Certificates.command
I spent several hours trying to fix some Python and update certs on a VM. In my case I was working against a server that someone else had set up. It turned out that the wrong cert had been uploaded to the server. I found this command on another SO answer.
root@ubuntu:~/cloud-tools# openssl s_client -connect abc.def.com:443
CONNECTED(00000005)
depth=0 OU = Domain Control Validated, CN = abc.def.com
verify error_num=20:unable to get local issuer certificate
verify return:1
depth=0 OU = Domain Control Validated, CN = abc.def.com
verify error_num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
0 s:OU = Domain Control Validated, CN = abc.def.com
i:C = US, ST = Arizona, L = Scottsdale, O = "GoDaddy.com, Inc.", OU = http://certs.godaddy.com/repository/, CN = Go Daddy Secure Certificate Authority - G2
It’s worth shedding a bit more "hands-on" light about what happens here, adding upon @Steffen Ullrich’s answer here and elsewhere:
- urllib and “SSL: CERTIFICATE_VERIFY_FAILED” Error
- Python Urllib2 SSL error (a very detailed answer)
Notes:
- I’ll use another website than the OP, because the OP’s website currently has no issues.
- I used Ubunto to run the following commands (
curl
andopenssl
). I tried runningcurl
on my Windows 10, but got different, unhelpful output.
The error experienced by the OP can be "reproduced" by using the following curl
command:
curl -vvI https://www.vimmi.net
Which outputs (note the last line):
* TCP_NODELAY set
* Connected to www.vimmi.net (82.80.192.7) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS alert, Server hello (2):
* SSL certificate problem: unable to get local issuer certificate
* stopped the pause stream!
* Closing connection 0
curl: (60) SSL certificate problem: unable to get local issuer certificate
Now let’s run it with the --insecure
flag, which will display the problematic certificate:
curl --insecure -vvI https://www.vimmi.net
Outputs (note the last two lines):
* Rebuilt URL to: https://www.vimmi.net/
* Trying 82.80.192.7...
* TCP_NODELAY set
* Connected to www.vimmi.net (82.80.192.7) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* [...]
* Server certificate:
* subject: OU=Domain Control Validated; CN=vimmi.net
* start date: Aug 5 15:43:45 2019 GMT
* expire date: Oct 4 16:16:12 2020 GMT
* issuer: C=US; ST=Arizona; L=Scottsdale; O=GoDaddy.com, Inc.; OU=http://certs.godaddy.com/repository/; CN=Go Daddy Secure Certificate Authority - G2
* SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
The same result can be seen using openssl
, which is worth mentioning because it’s used internally by python:
echo | openssl s_client -connect vimmi.net:443
Outputs:
CONNECTED(00000005)
depth=0 OU = Domain Control Validated, CN = vimmi.net
verify error_num=20:unable to get local issuer certificate
verify return:1
depth=0 OU = Domain Control Validated, CN = vimmi.net
verify error_num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
0 s:OU = Domain Control Validated, CN = vimmi.net
i:C = US, ST = Arizona, L = Scottsdale, O = "GoDaddy.com, Inc.", OU = http://certs.godaddy.com/repository/, CN = Go Daddy Secure Certificate Authority - G2
---
Server certificate
-----BEGIN CERTIFICATE-----
[...]
-----END CERTIFICATE-----
[...]
---
DONE
So why both curl
and openssl
can’t verify the certificate Go Daddy issued for that website?
Well, to "verify a certificate" (to use openssl’s error message terminology) means to verify that the certificate contains a trusted source signature (put differently: the certificate was signed by a trusted source), thus verifying vimmi.net
identity ("identity" here strictly means that "the public key contained in the certificate belongs to the person, organization, server or other entity noted in the certificate").
A source is "trusted" if we can establish its "chain of trust", with the following properties:
- The Issuer of each certificate (except the last one) matches the Subject of the next certificate in the list
- Each certificate (except the last one) is signed by the secret key corresponding to the next certificate in the chain (i.e. the signature
of one certificate can be verified using the public key contained in
the following certificate)- The last certificate in the list is a trust anchor: a certificate that you trust because it was delivered to you by some trustworthy
procedure
In our case, the issuer is "Go Daddy Secure Certificate Authority – G2". That is, the entity named "Go Daddy Secure Certificate Authority – G2" signed the certificate, so it’s supposed to be a trusted source.
To establish this entity’s trustworthiness, we have 2 options:
-
Assume that "Go Daddy Secure Certificate Authority – G2" is a "trust anchor" (see listing 3 above). Well, it turns out that
curl
andopenssl
try to act upon this assumption: they searched that entity’s certificate on their default paths (called CA paths), which are:- for
curl
, it’s/etc/ssl/certs
. - for
openssl
, it’s/use/lib/ssl
(runopenssl version -a
to see that).
- for
But that certificate wasn’t found, leaving us with a second option:
- Follow steps 1 and 2 listed above; in order to do that, we need to get the certificate issued for that entity.
This can be achieved by downloading it from its source, or using the browser.- for example, go to
vimmi.net
using Chrome, click the padlock > "Certificate" > "Certification Path" tab, select the entity > "View Certificate", then in the opened window go to "Details" tab > "Copy to File" > Base-64 encoded > save the file)
- for example, go to
Great! Now that we have that certificate (which can be in whatever file format: cer
, pem
, etc.; you can even save it as a txt
file), let’s tell curl
to use it:
curl --cacert test.cer https://vimmi.net
Going back to Python
Once we have:
- "Go Daddy Secure Certificate Authority – G2" certificate
- "Go Daddy Root Certificate Authority – G2" certificate (wasn’t mentioned above, but can be achieved in a similar way).
We need to copy their contents into a single file, let’s call it combined.cer
, and let’s put it in the current directory. Then, simply:
import requests
res = requests.get("https://vimmi.net", verify="./combined.cer")
print (res.status_code) # 200
- BTW, "Go Daddy Root Certificate Authority – G2" is listed as a trusted authority by browsers and various tools; that’s why we didn’t have to specify it for
curl
.
Further reading:
- how are ssl certificates verified, especially @ychaouche image.
- The First Few Milliseconds of an HTTPS Connection
- Wikipedia: Public key certificate, Certificate authority
- Nice video: Basics of Certificate Chain Validation.
- Helpful SE answers that focus on certificate signature terminology: 1, 2, 3.
- Certificates in relation to Man-In-The-Middle attack: 1, 2.
- The most dangerous code in the world: validating SSL certificates in non-browser software