Access hdfs from outside hadoop Access hdfs from outside hadoop hadoop hadoop

Access hdfs from outside hadoop


There are a couple typical ways:

  • You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
  • You can stream your data with hadoop cat if your program takes input through stdin: hadoop fs -cat /path/to/file/part-r-* | myprogram.pl. You could hypothetically create a bridge with this command line command with something like popen.


Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.


The best way is install "hadoop-0.20-native" package on the box where you are running your code. hadoop-0.20-native package can access hdfs filesystem. It can act as a hdfs proxy.