How to extract top-level domain name (TLD) from URL

python url parsing dns extract

Here's a great python module someone wrote to solve this problem after seeing this question:https://github.com/john-kurkowski/tldextract

The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers

Quote:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

python url parsing dns extract

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

python url parsing dns extract

Using this file of effective tlds which someone else found on Mozilla's website:

from __future__ import with_statementfrom urlparse import urlparse# load tlds, ignore comments and empty lines:with open("effective_tld_names.dat.txt") as tld_file:    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]def get_domain(url, tlds):    url_elements = urlparse(url)[1].split('.')    # url_elements = ["abcde","co","uk"]    for i in range(-len(url_elements), 0):        last_i_elements = url_elements[i:]        #    i=-3: ["abcde","co","uk"]        #    i=-2: ["co","uk"]        #    i=-1: ["uk"] etc        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *        exception_candidate = "!" + candidate        # match tlds:         if (exception_candidate in tlds):            return ".".join(url_elements[i:])         if (candidate in tlds or wildcard_candidate in tlds):            return ".".join(url_elements[i-1:])            # returns "abcde.co.uk"    raise ValueError("Domain not in global list of TLDs")print get_domain("http://abcde.co.uk", tlds)

results in:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

CodeHunter

How to extract top-level domain name (TLD) from URL

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last