What is the difference between Rack-local map tasks and Data-local map tasks?

hadoop mapreduce hadoop-streaming

In a data-local task, nothing needs to be copied. That's because the block is physically on the same server like the computation.

The next tier is the rack-local task, here the data must be copied, because there is no local copy of the desired block available. Note that rack-local does only copy within the rack-local switching of the network.

There is also the worst case, where the data isn't available local, nor on the same rack. So this must be copied over two switches to the hosts where the computation runs. I don't know if there is a counter for that, but basically this must be #all tasks - #data-local tasks - #rack-local tasks.

hadoop mapreduce hadoop-streaming

I would point out that providing gigabit (or faster) network between computers within the same rack is much cheaper that for bigger number of computers.
The root cause is the fact that ethernet switches are not scalable and we can not have such switch for hundreds of ports in reasonable price.
Because of it hadoop tries to run tasks at least in the same rack, if can not do it on the node where data is stored.

CodeHunter

What is the difference between Rack-local map tasks and Data-local map tasks?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last