Apache Helix vs YARN Apache Helix vs YARN hadoop hadoop

Apache Helix vs YARN


While Helix and YARN both provide capabilities to manage distributed applications, there are important differences between the two.

YARN primarily provides resource management capabilities across a cluster of machines while requiring applications to write their custom logic to negotiate resources from the resource manager. On the other hand, Helix provides a way of declaratively managing the state of distributed applications, thus freeing the applications from having to do a custom implementation. At this time, Helix does not provide resource management capabilities in the same way as YARN. Thus the two systems are quite complementary.

As an illustration, assume you have a set of nodes and you want to start some containers on them.

  1. Allocate containers among nodes based on the resource utilization
  2. start containers,
  3. monitor container, if they die restart containers

YARN provides the framework/machinery to do the above. Once you have the containers, you have to implement the following features:

  1. Partitioning and Replication: You need to distribute tasks to containers, possibly allocate multiple tasks to each container. For redundancy you might chose to allocate a task to multiple containers.
  2. State management: manage the state of the task
  3. Fault Tolerance: When a container fails you might either chose to redistribute work among remaining containers or restart the container depending on SLA requirement.
  4. Cluster expansion: You might start new containers to handle the workload, then you want the task to be re-distributed.
  5. Throttling: During all these operations you might want to limit some operations like data movement

Helix makes it easy to achieve the above features. In YARN one needs to write the application master to achieve these (A example of such implementation is the Application master for hadoop map reduce jobs).

Helix was developed at LinkedIn to manage distributed data systems in the online/nearline space. In this space once a container is launched it runs for ever until it crashes. When a container fails, tasks might be redistributed among remaining containers.

YARN comes with resource scheduling algorithms that allows flexible and efficient utilization of the available hardware for short lived tasks like the map reduce jobs.