Amazon Emr - What is the need of Task nodes when we have Core nodes? Amazon Emr - What is the need of Task nodes when we have Core nodes? hadoop hadoop

Amazon Emr - What is the need of Task nodes when we have Core nodes?


According to AWS documentation [1]

The node types in Amazon EMR are as follows:Master node: A node that manages the cluster by running softwarecomponents to coordinate the distribution of data and tasks amongother nodes for processing. The master node tracks the status of tasksand monitors the health of the cluster. Every cluster has a masternode, and it's possible to create a single-node cluster with only themaster node.

Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on yourcluster. Multi-node clusters have at least one core node.

Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.

According to AWS documentation [2]

Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors.

Task nodes don't run the Data Node daemon, nor do they store data in HDFS.

Some Use cases are:

  • You can use Task nodes for processing streams from S3. In this case Network IO won't increase as the used data isn't on HDFS.
  • Task nodes can be added or removed as no HDFS daemons are running. Hence, no data on task nodes. Core nodes have HDFS daemons running and keep adding and removing new nodes isn't a good practice.

Resources:

[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters

[2] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html#emr-plan-task


  • Traditional Hadoop assumes all your workload requires high I/O, with EMR you can choose instance type based on your workload. For high IO needs example up to 100Gbps go with C type or R type, and you can use placement groups. And keep your Core Nodes to Task nodes ratio to 1:5 or lower, this will keep the I/O optimal and if you want higher throughput select C's or R's as your Core and Task. (edited - explaining barely any perf loss with EMR)

  • Task node's advantage it can scale up/down faster and can minimize compute cost. Traditional Hadoop Cluster it's hard to scale either ways since slaves also part of HDFS.Task nodes are optional since core nodes can run Map and Reduce.

  • Core nodes takes longer to scale up/down depending on the tasks hence given the option of Task node for quicker auto scaling.

Reference: https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/


One use case is if you use spot instances as task nodes. If its cheap enough, it may be worth while to add some compute power to your EMR cluster. This would be mostly for non-sensitive tasks.