S3 and EMR data locality [closed]

amazon-web-services hadoop amazon-s3 amazon-ec2 amazon-emr

EMR does not pull data from S3 to HDFS. It uses its own implementation of HDFS support on S3 (as if you are operating on an actual HDFS). https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html

As for data locality, S3 is RACK_LOCAL to EMR spark clusters.

amazon-web-services hadoop amazon-s3 amazon-ec2 amazon-emr

As per the source mentioned below, EMR+S3 with EMRFS doesn't maintain data locality and is not suitable for analytics processing based on tools such as SQL. RedShift is the right choice for such use cases where compute and data are at one place. Please refer to 39:00 to 42:00 in the below link:

https://youtu.be/08G9NfDETVE

This is also mentioned in https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html. Please refer to the performance per dollar section.

To check how EMR works with S3 please refer to Programming elastic map reduce book by KEVIN SCHMIDT & CHRISTOPHER PHILLIPS(Chapter 1 Amazon Elastic MapReduce Versus Traditional Hadoop Installs section).

CodeHunter

S3 and EMR data locality [closed]

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last