Access hdfs from outside hadoop
There are a couple typical ways:
- You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
- You can stream your data with hadoop cat if your program takes input through stdin:
hadoop fs -cat /path/to/file/part-r-* | myprogram.pl
. You could hypothetically create a bridge with this command line command with something like popen.
Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.