Reading remote HDFS file with Java Reading remote HDFS file with Java hadoop hadoop

Reading remote HDFS file with Java


Hadoop error messages are frustrating. Often they don't say what they mean and have nothing to do with the real issue. I've seen problems like this occur when the client, namenode, and datanode cannot communicate properly. In your case I would pick one of two issues:

  • Your cluster runs in a VM and its virtualized network access to the client is blocked.
  • You are not consistently using fully-qualified domain names (FQDN) that resolve identically between the client and host.

The host name "test.server" is very suspicious. Check all of the following:

  • Is test.server a FQDN?
  • Is this the name that has been used EVERYWHERE in your conf files?
  • Can the client and all hosts forward and reverse resolve"test.server" and its IP address and get the same thing?
  • Are IP addresses being used instead of FQDN anywhere?
  • Is "localhost" being used anywhere?

Any inconsistency in the use of FQDN, hostname, numeric IP, and localhost must be removed. Do not ever mix them in your conf files or in your client code. Consistent use of FQDN is preferred. Consistent use of numeric IP usually also works. Use of unqualified hostname, localhost, or 127.0.0.1 cause problems.


We need to make sure to have configuration with fs.default.name space set such as

configuration.set("fs.default.name","hdfs://ourHDFSNameNode:50000");

Below I've put a piece of sample code:

 Configuration configuration = new Configuration(); configuration.set("fs.default.name","hdfs://ourHDFSNameNode:50000"); FileSystem fs = pt.getFileSystem(configuration); BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt))); String line = null; line = br.readLine while (line != null) {  try {    line = br.readLine    System.out.println(line);  }}


The answer above is pointing to the right direction. Allow me to add the following:

  1. Namenode does NOT directly read or write data.
  2. Client (your Java program using Direct access to HDFS) interacts with Namenode to update HDFS namespace and retrieve block locations for reading/writing.
  3. Client interacts directly with Datanode to read/write data.

You were able to list directory contents because hostname:9000was accessible to your client code. You were doing the number 2 above.
To be able to read and write, your client code needs access to the Datanode (number 3). The default port for Datanode DFS data transfer is 50010. Something was blocking your client communication to hostname:50010. Possibly a firewall or SSH tunneling configuration problem.
I was using Hadoop 2.7.2, so maybe you have a different port number setting.