Hadoop on EC2 vs Elastic Map Reduce

We use both approaches (EMR and EC2) at my job.

The advantages of EMR that Amar mentioned are more or less true: so if you want simplicity it may be the way to go.

But there are other considerations:

the version of EMR is far behind apache head. it is approximately 0.20.205 whereas head is at 2.X, which is essentially 3 versions up (1.0, 1.1, 2.0..)

hadoop@domU-12-31-39-07-B9-97:~$ ll hadoop*.jarlrwxrwxrwx 1 hadoop hadoop 73 Feb 5 12:00 hadoop-examples-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-examples-0.20.205.jarlrwxrwxrwx 1 hadoop hadoop 69 Feb 5 12:00 hadoop-test-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-test-0.20.205.jarlrwxrwxrwx 1 hadoop hadoop 69 Feb 5 12:00 hadoop-core-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-core-0.20.205.jarlrwxrwxrwx 1 hadoop hadoop 70 Feb 5 12:00 hadoop-tools-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-tools-0.20.205.jarlrwxrwxrwx 1 hadoop hadoop 68 Feb 5 12:00 hadoop-ant-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-ant-0.20.205.jar

As a direct consequence I had to re-code /restructure my Map/reduce program due to missing contrib modules in the older version running on EMR
You do not have as much of an opportunity to use non-Map/Reduce algorithms as if you were using updated version of M/R.
Flexibility to mix and match versions of hadoop ecosystem.

hadoop amazon-web-services

Well, administering/monitoring/maintaining a cluster isn't a small task in itself.Using EMR really you could get machines configured and up and running with your custom bootstrap code in no time.Apart from doing all these things EMR provides a A lot of other tools/options/facilities too.

Here you don't have to worry about terminating a cluster after the jobs are done, you can surely implement a way for yourself in the EC2+Hadoop setup, but EMR does this for you in a neat way.

Also you have facility to resize the cluster size even while your jobs are running!

The Pig and Hive that are available with EMR also contain patches which make it easier to work with files in S3.

Even here in this answer you may find that EMR has been given an upper hand.

hadoop amazon-web-services

I answered another question so it might help to add this here as well as they are related.

Someone mentioned in comment (and there is infact impression in people) that EMR adds some cost on top of ec2 nodes (which is underlying master/compute nodes of spark) and provides just the cluster, which isnt the case.

But what elastic map reduce is focused on is elastic and scalability part , meaning to provide scalability for your jobs, where scalability is not just number of node in cluster but different parameters like

Dynamically resizing the cluster with running jobs
Reduces and optimizes spin time , provides efficient resubmitting steps and option like automatic termination on step completion
Configuration, management and updation time. Just as an small you have things like release version that automatically handles spark/hadoop/other-application versions providing you way to easy update the version which you have to do manually with ec2.
the ecosystem availability. EMR ecosystem is growing,it doesnt reflect when you start but for example when your requirements grow, for example when you start to integrate other systems stream processing with flink for example) then it is more easier to just select at time of launching flink, pig , hive and moany more etc if you need to use other things in future.
There are already implementing libraries with AWS SDK like boto3 in python that help you to submit steps, poll for completion etc, which are very helpful when you need to scale. Also, you have integration of emr with orchestration frameworks like airflow where can can sense the state, resubmit, one command spin the cluster within the pipeline.
Expanding on previous point, EMR notebook for example provide you the quick and interactive way to submit spark jobs from Jupiter notebook and see the result, progress of jobs immediately which can boost your productivity.
This point is most important from my experience, Sometimes, scaling up the jobs with more nodes save you more money then long running jobs with low number of nodes. Because the adding node cost sometime cost you low than the normalized hours you will be spending with ec2 or small emr cluster. Just to share my experience, we had a job that used to run for 3 days, we satrted to run it with bigger EMR cluster that reduced it to 6-8 hours and it still was in the same cost and was infact a bit less.

CodeHunter

Hadoop on EC2 vs Elastic Map Reduce

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last