EMR notebooks install additional libraries EMR notebooks install additional libraries bash bash

EMR notebooks install additional libraries


What is the canonical way of installing additional libraries for notebooks created through the EMR interface?

EMR Notebooks recently launched 'notebook-scoped libraries' using which you can install additional Python libraries on your cluster from public or private PyPI repository and use it within notebook session.

Notebook-scoped libraries provide the following benefits:

  • You can use libraries in an EMR notebook without having to re-createthe cluster or re-attach the notebook to a cluster.
  • You can isolate library dependencies of an EMR notebook to the individual notebook session. The libraries installed from within the notebook cannot interfere with other libraries on the cluster or libraries installed within other notebook sessions.

Here are more details,https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html

Technical blog:https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/


What I usually do in this case is deleting my cluster and creating a new one with bootstrap actions. Bootstrap actions allow you to install additional libraries on your cluster : https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html.For example writing the following script and saving it in S3 will allow you to use datadog from your notebook running on top of your cluster (at least it works with EMR 5.19) :

#!/bin/bash -xe#install datadog module for using in pysparksudo pip-3.4 install -U datadog

Here is the command line I would run for launching this cluster :

aws emr create-cluster --release-label emr-5.19.0 \--name 'EMR 5.19 test' \--applications Name=Hadoop Name=Spark Name=Hive Name=Livy \--use-default-roles \--instance-groups \InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large \InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large \--region eu-west-1 \--log-uri s3://<path-to-logs> \--configurations file://config-emr.json \--bootstrap-actions Path=s3://<path-to-bootstrap-in-aws>,Name=InstallPythonModules

And the config-emr.json that is stored locally on your computer :

[{    "Classification": "spark",    "Properties": {    "maximizeResourceAllocation": "true"    }},{    "Classification": "spark-env",    "Configurations": [    {        "Classification": "export",        "Properties": {            "PYSPARK_PYTHON": "/usr/bin/python3"        }    }    ]}]   

I assume that you could do exactly the same thing when creating a cluster through the EMR interface by going to the advanced options of creation.


For the sake of an example, let's assume you need librosa Python module on running EMR cluster. We're going to use Python 2.7 as the procedure is simpler - Python 2.7 is guaranteed to be on the cluster and that's the default runtime for EMR.

Create a script that installs the package:

#!/bin/bashsudo easy_install-2.7 pipsudo /usr/local/bin/pip2 install librosa

and save it to your home directory, e.g. /home/hadoop/install_librosa.sh. Note the name, we're going to use it later.

In the next step you're going to run this script through another script inspired by Amazon EMR docs: emr_install.py. It uses AWS Systems Manager to execute your script over the nodes.

import timefrom boto3 import clientfrom sys import argvtry:  clusterId=argv[1]except:  print("Syntax: emr_install.py [ClusterId]")  import sys  sys.exit(1)emrclient=client('emr')# Get list of core nodesinstances=emrclient.list_instances(ClusterId=clusterId,InstanceGroupTypes=['CORE'])['Instances']instance_list=[x['Ec2InstanceId'] for x in instances]# Attach tag to core nodesec2client=client('ec2')ec2client.create_tags(Resources=instance_list,Tags=[{"Key":"environment","Value":"coreNodeLibs"}])ssmclient=client('ssm')    # Run shell script to install librariescommand=ssmclient.send_command(Targets=[{"Key": "tag:environment", "Values":["coreNodeLibs"]}],                               DocumentName='AWS-RunShellScript',                               Parameters={"commands":["bash /home/hadoop/install_librosa.sh"]},                               TimeoutSeconds=3600)['Command']['CommandId']command_status=ssmclient.list_commands(  CommandId=command,  Filters=[      {          'key': 'Status',          'value': 'SUCCESS'      },  ])['Commands'][0]['Status']time.sleep(30)print("Command:" + command + ": " + command_status)

To run it:

python emr_install.py [cluster_id]