Hadoop or Hadoop Streaming for MapReduce on AWS Hadoop or Hadoop Streaming for MapReduce on AWS hadoop hadoop

Hadoop or Hadoop Streaming for MapReduce on AWS


You have a few options for running Hadoop on AWS. The simplest is to run your MapReduce jobs via their Elastic MapReduce service: http://aws.amazon.com/elasticmapreduce. You could also run a Hadoop cluster on EC2, as described at http://archive.cloudera.com/docs/ec2.html.

If you suspect you'll need to write your own input/output formats, partitioners, and combiners, I'd recommend using Java with the latter system. If your job is relatively simple and you don't plan to use your Hadoop cluster for any other purpose, I'd recommend choosing the language with which you are most comfortable and using EMR.

Either way, good luck!

Disclosure: I am a founder of Cloudera.

Regards,Jeff


I decided the flexibility of Java was more important than dealing with the possible shortcomings of adjusting my current code from C++ to Java.

Thanks for all your answers.


It depends on your needs.What is your input/output? Is it a simple text files? Records with new line delimiters?Do you need a special combiner? partitioner?

What i mean is, that if you need only the hadoop basics, than streaming will be fine.But if you need a little more complexity (from the hadoop framework, not from your own business logic), hadoop jar will be more flexible.

Sagie