How AWS EMR resize How AWS EMR resize hadoop hadoop

How AWS EMR resize


The master and slaves files are only used by the shell scripts like start-all.sh, start-dfs.sh etc. These files are not used by any other function in hadoop. From the hadoop cluster perspective, the location where the namenode, secondary namenode, worker nodes are not defined by these files. EMR is not using those shell scripts for starting the cluster. The property fs.default.name or fs.defaultFS in the core-site.xml defines the namenode host. All the datanodes that starts with this configuration will report to the namenode and gets added to the cluster. Similarly, the resourcemanager host is defined in the yarn-site.xml of all the nodes.

We don't need to restart any process in the cluster for adding new nodes. Once the datanode is up, it will report to the namenode and in this way the node will contribute to the HDFS. Similarly once the nodemanager is up, it will report to the resourcemanager of the cluster and it will contribute to the processing layer.

In EMR we have 3 types of nodes.

  • Master node
  • Core node
  • Task node

For an EMR cluster the master nodes will be only one. This node is the node which has namenode and all the master services such as Resourcemanager, HBase Master etc.

Core node is the node which has storage as well as processing capability which means it has datanode and nodemanager. We can increase the number of core nodes, but we can't decrease the number because it will result in data loss.

Task nodes are the nodes which has only processing capability. This is basically for serving transient loads. This has only nodemanager. No datanode is associated with this node. We can increase or decrease the number of task nodes.

While resizing the cluster, the existing cluster is not getting disturbed. The scripts like start-all.sh, stop-all.sh are not invoked in EMR. It starts individual services and brings up the cluster. So the entries in the master and slaves files are not considered.


You can expand/contract the cluster through the AWS console. enter image description here

Use the resize option to change the size of your cluster.

You can also add "task" nodes to the cluster as well through the console.


Using Autoscaling on AWS EMR, you can scale out and scale in nodes on a cluster. Scale out action can be triggered using Cloudwatch metrics(YARNMemoryAvailablePercentage and ContainerPendingRatio). Sample Policy below

    "AutoScalingPolicy":{ "Constraints":  {   "MinCapacity": 10,   "MaxCapacity": 50  }, "Rules": [  {"Name": "Compute-scale-up",   "Description": "Scale out based on ContainerPending Mterics",   "Action":    {     "SimpleScalingPolicyConfiguration":      {"AdjustmentType": "CHANGE_IN_CAPACITY",       "ScalingAdjustment": 1,       "CoolDown":0}  },   "Trigger":    {"CloudWatchAlarmDefinition":      {"AlarmNamePrefix": "compute-scale-up",       "ComparisonOperator": "GREATER_THAN_OR_EQUAL",       "EvaluationPeriods": 3,       "MetricName": "ContainerPending",       "Namespace": "AWS/ElasticMapReduce",       "Period": 300,       "Statistic": "AVERAGE",       "Threshold": 10,       "Unit": "COUNT",       "Dimensions":        [          {"Key": "JobFlowId",           "Value": "${emr:cluster_id}"}        ]      }    }  },  {"Name": "Compute-scale-down",   "Description": "Scale in",   "Action":    {      "SimpleScalingPolicyConfiguration":      {"AdjustmentType": "CHANGE_IN_CAPACITY",       "ScalingAdjustment": -1,       "CoolDown":300}    },   "Trigger":    {"CloudWatchAlarmDefinition":      {"AlarmNamePrefix": "compute-scale-down",       "ComparisonOperator": "GREATER_THAN_OR_EQUAL",       "EvaluationPeriods": 3,       "MetricName": "MemoryAvailableMB",       "Namespace": "AWS/ElasticMapReduce",       "Period": 300,       "Statistic": "AVERAGE",       "Threshold": 24000,       "Unit": "COUNT",       "Dimensions":        [          {"Key": "JobFlowId",           "Value": "${emr:cluster_id}"}        ]      }    }  } ]}

You can refer this blog for more details https://aws.amazon.com/blogs/big-data/dynamically-scale-applications-on-amazon-emr-with-auto-scaling/