Hadoop API VS. Hadoop Streaming Hadoop API VS. Hadoop Streaming hadoop hadoop

Hadoop API VS. Hadoop Streaming


Usually we have Map/Reduce pair written in java..a map which splits the dataset into independent chunks, and a reduce which combines the results to perform some useful analysis...Hadoop streaming is a utility which allows us to write Map/Reduce applications in any language(like Ruby/Python/Bash etc.) that is capable of working with STDIN(for input) and STDOUT(for output)!


You're right to say that if you don't use Java you will not have the core hadoop functions available. THings like ChainMapper and ChainReducer, ChainedJobs and such are not available via streaming. Also, as Hadoop is written in Java, using Java will make it faster.

Also, another thing, theoretically, no reducer starts after the mapper is done. What you might see in the HTML as reducers running at the same time it's input being moved around.


Hadoop Streaming enables us to write map and reduce functions in any programming or scripting language that supports reading data from standard input and writing to standard output. This feature makes Hadoop Streaming very flexible and can be easily used by a large number of users. R, Python, C++ , or pretty much any other language. There are a lot of parameters that can be customized, for example, number of mappers, number of reducers, jvm memory, input format, output format etc. The default input format for hadoop streaming jobs is TextInputFormat, which reads the data one line at a time.

Hadoop APIPretty much binds you to Java, but the configuration and development is more straightforward since everything can be configured from the Java code itself. From my experience Java seems to be slightly faster, but streaming can get pretty close when properly configured and used with the right language.