Get domain name from given url Get domain name from given url java java

Get domain name from given url


If you want to parse a URL, use java.net.URI. java.net.URL has a bunch of problems -- its equals method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.

"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI instead.

public static String getDomainName(String url) throws URISyntaxException {    URI uri = new URI(url);    String domain = uri.getHost();    return domain.startsWith("www.") ? domain.substring(4) : domain;}

should do what you want.


Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.

Your code as written fails for the valid URLs:

  • httpfoo/bar -- relative URL with a path component that starts with http.
  • HTTP://example.com/ -- protocol is case-insensitive.
  • //example.com/ -- protocol relative URL with a host
  • www/foo -- a relative URL with a path component that starts with www
  • wwwexample.com -- domain name that does not starts with www. but starts with www.

Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.

If you really need to deal with messy inputs that java.net.URI rejects, see RFC 3986 Appendix B:

Appendix B. Parsing a URI Reference with a Regular Expression

As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.

The following line is the regular expression for breaking-down a well-formed URI reference into its components.

  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?   12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis).


import java.net.*;import java.io.*;public class ParseURL {  public static void main(String[] args) throws Exception {    URL aURL = new URL("http://example.com:80/docs/books/tutorial"                       + "/index.html?name=networking#DOWNLOADING");    System.out.println("protocol = " + aURL.getProtocol()); //http    System.out.println("authority = " + aURL.getAuthority()); //example.com:80    System.out.println("host = " + aURL.getHost()); //example.com    System.out.println("port = " + aURL.getPort()); //80    System.out.println("path = " + aURL.getPath()); //  /docs/books/tutorial/index.html    System.out.println("query = " + aURL.getQuery()); //name=networking    System.out.println("filename = " + aURL.getFile()); ///docs/books/tutorial/index.html?name=networking    System.out.println("ref = " + aURL.getRef()); //DOWNLOADING  }}

Read more


Here is a short and simple line using InternetDomainName.topPrivateDomain() in Guava: InternetDomainName.from(new URL(url).getHost()).topPrivateDomain().toString()

Given http://www.google.com/blah, that will give you google.com. Or, given http://www.google.co.mx, it will give you google.co.mx.

As Sa Qada commented in another answer on this post, this question has been asked earlier: Extract main domain name from a given url. The best answer to that question is from Satya, who suggests Guava's InternetDomainName.topPrivateDomain()

public boolean isTopPrivateDomain()

Indicates whether this domain name is composed of exactly one subdomain component followed by a public suffix. For example, returns true for google.com and foo.co.uk, but not for www.google.com or co.uk.

Warning: A true result from this method does not imply that the domain is at the highest level which is addressable as a host, as many public suffixes are also addressable hosts. For example, the domain bar.uk.com has a public suffix of uk.com, so it would return true from this method. But uk.com is itself an addressable host.

This method can be used to determine whether a domain is probably the highest level for which cookies may be set, though even that depends on individual browsers' implementations of cookie controls. See RFC 2109 for details.

Putting that together with URL.getHost(), which the original post already contains, gives you:

import com.google.common.net.InternetDomainName;import java.net.URL;public class DomainNameMain {  public static void main(final String... args) throws Exception {    final String urlString = "http://www.google.com/blah";    final URL url = new URL(urlString);    final String host = url.getHost();    final InternetDomainName name = InternetDomainName.from(host).topPrivateDomain();    System.out.println(urlString);    System.out.println(host);    System.out.println(name);  }}