EMR notebooks install additional libraries

bash amazon-web-services jupyter-notebook libraries amazon-emr

What is the canonical way of installing additional libraries for notebooks created through the EMR interface?

EMR Notebooks recently launched 'notebook-scoped libraries' using which you can install additional Python libraries on your cluster from public or private PyPI repository and use it within notebook session.

Notebook-scoped libraries provide the following benefits:

You can use libraries in an EMR notebook without having to re-createthe cluster or re-attach the notebook to a cluster.
You can isolate library dependencies of an EMR notebook to the individual notebook session. The libraries installed from within the notebook cannot interfere with other libraries on the cluster or libraries installed within other notebook sessions.

Here are more details,https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html

Technical blog:https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

bash amazon-web-services jupyter-notebook libraries amazon-emr

What I usually do in this case is deleting my cluster and creating a new one with bootstrap actions. Bootstrap actions allow you to install additional libraries on your cluster : https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html.For example writing the following script and saving it in S3 will allow you to use datadog from your notebook running on top of your cluster (at least it works with EMR 5.19) :

#!/bin/bash -xe#install datadog module for using in pysparksudo pip-3.4 install -U datadog

Here is the command line I would run for launching this cluster :

aws emr create-cluster --release-label emr-5.19.0 \--name 'EMR 5.19 test' \--applications Name=Hadoop Name=Spark Name=Hive Name=Livy \--use-default-roles \--instance-groups \InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large \InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large \--region eu-west-1 \--log-uri s3://<path-to-logs> \--configurations file://config-emr.json \--bootstrap-actions Path=s3://<path-to-bootstrap-in-aws>,Name=InstallPythonModules

And the config-emr.json that is stored locally on your computer :

[{    "Classification": "spark",    "Properties": {    "maximizeResourceAllocation": "true"    }},{    "Classification": "spark-env",    "Configurations": [    {        "Classification": "export",        "Properties": {            "PYSPARK_PYTHON": "/usr/bin/python3"        }    }    ]}]

I assume that you could do exactly the same thing when creating a cluster through the EMR interface by going to the advanced options of creation.

bash amazon-web-services jupyter-notebook libraries amazon-emr

For the sake of an example, let's assume you need librosa Python module on running EMR cluster. We're going to use Python 2.7 as the procedure is simpler - Python 2.7 is guaranteed to be on the cluster and that's the default runtime for EMR.

Create a script that installs the package:

#!/bin/bashsudo easy_install-2.7 pipsudo /usr/local/bin/pip2 install librosa

and save it to your home directory, e.g. /home/hadoop/install_librosa.sh. Note the name, we're going to use it later.

In the next step you're going to run this script through another script inspired by Amazon EMR docs: emr_install.py. It uses AWS Systems Manager to execute your script over the nodes.

import timefrom boto3 import clientfrom sys import argvtry:  clusterId=argv[1]except:  print("Syntax: emr_install.py [ClusterId]")  import sys  sys.exit(1)emrclient=client('emr')# Get list of core nodesinstances=emrclient.list_instances(ClusterId=clusterId,InstanceGroupTypes=['CORE'])['Instances']instance_list=[x['Ec2InstanceId'] for x in instances]# Attach tag to core nodesec2client=client('ec2')ec2client.create_tags(Resources=instance_list,Tags=[{"Key":"environment","Value":"coreNodeLibs"}])ssmclient=client('ssm')    # Run shell script to install librariescommand=ssmclient.send_command(Targets=[{"Key": "tag:environment", "Values":["coreNodeLibs"]}],                               DocumentName='AWS-RunShellScript',                               Parameters={"commands":["bash /home/hadoop/install_librosa.sh"]},                               TimeoutSeconds=3600)['Command']['CommandId']command_status=ssmclient.list_commands(  CommandId=command,  Filters=[      {          'key': 'Status',          'value': 'SUCCESS'      },  ])['Commands'][0]['Status']time.sleep(30)print("Command:" + command + ": " + command_status)

To run it:

python emr_install.py [cluster_id]

CodeHunter

EMR notebooks install additional libraries

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last