Is there any loss of functionality with streaming jobs in hbase/hadoop versus using java? Is there any loss of functionality with streaming jobs in hbase/hadoop versus using java? hadoop hadoop

Is there any loss of functionality with streaming jobs in hbase/hadoop versus using java?


As far as I know, you are talking about 2(or more) totally different concepts.

"Hadoop Streaming" is there to stream data through your executable (independent from your choice of programming language). When using streaming there can't be any loss of functionality, since the functionality is basicly map/reduce the data you are getting from hadoop stream.

For hadoop part you can even use pig or hive big data query languages to get things done efficiently. With the newest versions of pig you can even write custom functions in python and use them inside your pig scripts.

Although there are tools to make you use the language you are comfortable with never forget that hadoop framework is mostly written in java. There could be times when you would need to write a specialized InputFormat; or a UDF inside pig, etc. Then a decent knowledge in java would come handy.

Your "Hbase coprocessors" example is kinda unrelated with streaming functionality of hadoop. Hbase coproccessors consists of 2 parts : server-side part, client-side part. I am pretty sure there would be some useful server-side coprocessors embedded inside hbase with release; but other than that you would need to write your own coprocessor (and bad news: its java). For client side I am sure you would be able to use them with your favorite programming language through thrift without too much problem.

So as an answer to your question: you can always dodge learning java; still using hadoop to it's potential (using 3rd party libraries/applications). But when shit hits the fan its better to understand the underlaying content; to be able to develop with java. Knowing java would give you a full control over hadoop/hbase enviroment.

Hope you would find this helpful.


Yes, you should get data local code execution with streaming. You do not push the data to where the program is, you push the program to where the data is. Streaming simply takes the local input data and runs it through stdin to your python program. Instead of each map running inside of a java task, it spins up and instance of your python program and just pumps the input through that.

If you really want to do fast processing you really should learn java though. Having to pipe everything through stdin and stout is a lot of overhead.