How to extract top-level domain name (TLD) from URL How to extract top-level domain name (TLD) from URL python python

How to extract top-level domain name (TLD) from URL


Here's a great python module someone wrote to solve this problem after seeing this question:https://github.com/john-kurkowski/tldextract

The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers

Quote:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.


No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).


Using this file of effective tlds which someone else found on Mozilla's website:

from __future__ import with_statementfrom urlparse import urlparse# load tlds, ignore comments and empty lines:with open("effective_tld_names.dat.txt") as tld_file:    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]def get_domain(url, tlds):    url_elements = urlparse(url)[1].split('.')    # url_elements = ["abcde","co","uk"]    for i in range(-len(url_elements), 0):        last_i_elements = url_elements[i:]        #    i=-3: ["abcde","co","uk"]        #    i=-2: ["co","uk"]        #    i=-1: ["uk"] etc        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *        exception_candidate = "!" + candidate        # match tlds:         if (exception_candidate in tlds):            return ".".join(url_elements[i:])         if (candidate in tlds or wildcard_candidate in tlds):            return ".".join(url_elements[i-1:])            # returns "abcde.co.uk"    raise ValueError("Domain not in global list of TLDs")print get_domain("http://abcde.co.uk", tlds)

results in:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?