Urllib and validation of server certificate
You could create a urllib2 opener which can do the validation for you using a custom handler. The following code is an example that works with Python 2.7.3 . It assumes you have downloaded http://curl.haxx.se/ca/cacert.pem to the same folder where the script is saved.
#!/usr/bin/env pythonimport urllib2import httplibimport sslimport socketimport osCERT_FILE = os.path.join(os.path.dirname(__file__), 'cacert.pem')class ValidHTTPSConnection(httplib.HTTPConnection): "This class allows communication via SSL." default_port = httplib.HTTPS_PORT def __init__(self, *args, **kwargs): httplib.HTTPConnection.__init__(self, *args, **kwargs) def connect(self): "Connect to a host on a given (SSL) port." sock = socket.create_connection((self.host, self.port), self.timeout, self.source_address) if self._tunnel_host: self.sock = sock self._tunnel() self.sock = ssl.wrap_socket(sock, ca_certs=CERT_FILE, cert_reqs=ssl.CERT_REQUIRED)class ValidHTTPSHandler(urllib2.HTTPSHandler): def https_open(self, req): return self.do_open(ValidHTTPSConnection, req)opener = urllib2.build_opener(ValidHTTPSHandler)def test_access(url): print "Acessing", url page = opener.open(url) print page.info() data = page.read() print "First 100 bytes:", data[0:100] print "Done accesing", url print ""# This should worktest_access("https://www.google.com")# Accessing a page with a self signed certificate should not work# At the time of writing, the following page uses a self signed certificatetest_access("https://tidia.ita.br/")
Running this script you should see something a output like this:
Acessing https://www.google.comDate: Mon, 14 Jan 2013 14:19:03 GMTExpires: -1...First 100 bytes: <!doctype html><html itemscope="itemscope" itemtype="http://schema.org/WebPage"><head><meta itempropDone accesing https://www.google.comAcessing https://tidia.ita.br/Traceback (most recent call last): File "https_validation.py", line 54, in <module> test_access("https://tidia.ita.br/") File "https_validation.py", line 42, in test_access page = opener.open(url) ... File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1177, in do_open raise URLError(err)urllib2.URLError: <urlopen error [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed>
If you have a trusted Certificate Authority (CA) file, you can use Python 2.6 and later's ssl
library to validate the certificate. Here's some code:
import os.pathimport sslimport sysimport urlparseimport urllibdef get_ca_path(): '''Download the Mozilla CA file cached by the cURL project. If you have a trusted CA file from your OS, return the path to that instead. ''' cafile_local = 'cacert.pem' cafile_remote = 'http://curl.haxx.se/ca/cacert.pem' if not os.path.isfile(cafile_local): print >> sys.stderr, "Downloading %s from %s" % ( cafile_local, cafile_remote) urllib.urlretrieve(cafile_remote, cafile_local) return cafile_localdef check_ssl(hostname, port=443): '''Check that an SSL certificate is valid.''' print >> sys.stderr, "Validating SSL cert at %s:%d" % ( hostname, port) cafile_local = get_ca_path() try: server_cert = ssl.get_server_certificate((hostname, port), ca_certs=cafile_local) except ssl.SSLError: print >> sys.stderr, "SSL cert at %s:%d is invalid!" % ( hostname, port) raise class CheckedSSLUrlOpener(urllib.FancyURLopener): '''A URL opener that checks that SSL certificates are valid On SSL error, it will raise ssl. ''' def open(self, fullurl, data = None): urlbits = urlparse.urlparse(fullurl) if urlbits.scheme == 'https': if ':' in urlbits.netloc: hostname, port = urlbits.netloc.split(':') else: hostname = urlbits.netloc if urlbits.port is None: port = 443 else: port = urlbits.port check_ssl(hostname, port) return urllib.FancyURLopener.open(self, fullurl, data)# Plain usage - can probably do once per daycheck_ssl('www.facebook.com')# URL Openeropener = CheckedSSLUrlOpener()opener.open('https://www.facebook.com/find-friends/browser/')# Make it the defaulturllib._urlopener = openerurllib.urlopen('https://www.facebook.com/find-friends/browser/')
Some dangers with this code:
- You have to trust the CA file from the cURL project (http://curl.haxx.se/ca/cacert.pem), which is a cached version of Mozilla's CA file. It's also over HTTP, so there is a potential MITM attack. It's better to replace
get_ca_path
with one that returns your local CA file, which will vary from host to host. - There is no attempt to see if the CA file has been updated. Eventually, root certs will expire or be deactivated, and new ones will be added. A good idea would be to use a cron job to delete the cached CA file, so that a new one is downloaded daily.
- It's probably overkill to check certificates every time. You could manually check once per run, or keep a list of 'known good' hosts over the course of the run. Or, be paranoid!