Hadoop on EC2 vs Elastic Map Reduce Hadoop on EC2 vs Elastic Map Reduce hadoop hadoop

Hadoop on EC2 vs Elastic Map Reduce


We use both approaches (EMR and EC2) at my job.

The advantages of EMR that Amar mentioned are more or less true: so if you want simplicity it may be the way to go.

But there are other considerations:

  • the version of EMR is far behind apache head. it is approximately 0.20.205 whereas head is at 2.X, which is essentially 3 versions up (1.0, 1.1, 2.0..)

hadoop@domU-12-31-39-07-B9-97:~$ ll hadoop*.jarlrwxrwxrwx 1 hadoop hadoop 73 Feb 5 12:00 hadoop-examples-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-examples-0.20.205.jarlrwxrwxrwx 1 hadoop hadoop 69 Feb 5 12:00 hadoop-test-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-test-0.20.205.jarlrwxrwxrwx 1 hadoop hadoop 69 Feb 5 12:00 hadoop-core-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-core-0.20.205.jarlrwxrwxrwx 1 hadoop hadoop 70 Feb 5 12:00 hadoop-tools-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-tools-0.20.205.jarlrwxrwxrwx 1 hadoop hadoop 68 Feb 5 12:00 hadoop-ant-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-ant-0.20.205.jar

  • As a direct consequence I had to re-code /restructure my Map/reduce program due to missing contrib modules in the older version running on EMR

  • You do not have as much of an opportunity to use non-Map/Reduce algorithms as if you were using updated version of M/R.

  • Flexibility to mix and match versions of hadoop ecosystem.


Well, administering/monitoring/maintaining a cluster isn't a small task in itself.Using EMR really you could get machines configured and up and running with your custom bootstrap code in no time.Apart from doing all these things EMR provides a A lot of other tools/options/facilities too.

Here you don't have to worry about terminating a cluster after the jobs are done, you can surely implement a way for yourself in the EC2+Hadoop setup, but EMR does this for you in a neat way.

Also you have facility to resize the cluster size even while your jobs are running!

The Pig and Hive that are available with EMR also contain patches which make it easier to work with files in S3.

Even here in this answer you may find that EMR has been given an upper hand.


I answered another question so it might help to add this here as well as they are related.

Someone mentioned in comment (and there is infact impression in people) that EMR adds some cost on top of ec2 nodes (which is underlying master/compute nodes of spark) and provides just the cluster, which isnt the case.

But what elastic map reduce is focused on is elastic and scalability part , meaning to provide scalability for your jobs, where scalability is not just number of node in cluster but different parameters like

  1. Dynamically resizing the cluster with running jobs

  2. Reduces and optimizes spin time , provides efficient resubmitting steps and option like automatic termination on step completion

  3. Configuration, management and updation time. Just as an small you have things like release version that automatically handles spark/hadoop/other-application versions providing you way to easy update the version which you have to do manually with ec2.

  4. the ecosystem availability. EMR ecosystem is growing,it doesnt reflect when you start but for example when your requirements grow, for example when you start to integrate other systems stream processing with flink for example) then it is more easier to just select at time of launching flink, pig , hive and moany more etc if you need to use other things in future.

  5. There are already implementing libraries with AWS SDK like boto3 in python that help you to submit steps, poll for completion etc, which are very helpful when you need to scale. Also, you have integration of emr with orchestration frameworks like airflow where can can sense the state, resubmit, one command spin the cluster within the pipeline.

  6. Expanding on previous point, EMR notebook for example provide you the quick and interactive way to submit spark jobs from Jupiter notebook and see the result, progress of jobs immediately which can boost your productivity.

  7. This point is most important from my experience, Sometimes, scaling up the jobs with more nodes save you more money then long running jobs with low number of nodes. Because the adding node cost sometime cost you low than the normalized hours you will be spending with ec2 or small emr cluster. Just to share my experience, we had a job that used to run for 3 days, we satrted to run it with bigger EMR cluster that reduced it to 6-8 hours and it still was in the same cost and was infact a bit less.