Use AWS Glue Python with NumPy and Pandas Python Packages Use AWS Glue Python with NumPy and Pandas Python Packages pandas pandas

Use AWS Glue Python with NumPy and Pandas Python Packages


You can check latest python packages installed using this script as glue job

import loggingimport piplogger = logging.getLogger(__name__)logger.setLevel(logging.INFO)if __name__ == '__main__':    logger.info(pip._internal.main(['list']))

As of 30-Jun-2020 Glue as has these python packages pre-installed. So numpy and pandas is covered.

awscli 1.16.242boto3 1.9.203botocore 1.12.232certifi 2020.4.5.1chardet 3.0.4colorama 0.3.9docutils 0.15.2idna 2.8jmespath 0.9.4numpy 1.16.2pandas 0.24.2pip 20.0.2pyasn1 0.4.8PyGreSQL 5.0.6python-dateutil 2.8.1pytz 2019.3PyYAML 5.2requests 2.22.0rsa 3.4.2s3transfer 0.2.1scikit-learn 0.20.3scipy 1.2.1setuptools 45.1.0six 1.14.0urllib3 1.25.8virtualenv 16.7.9wheel 0.34.2

You can install additional packages in glue-python if they are present in the requirements.txt used to build the attaching .whl. The whl file gets collected and installed before your script is kicked-off. I would also suggest you to look into Sagemaker Processing which is more easier for python based jobs. Unlike serveless instance for glue-python shell, you are not limited to 16gb limit there.


I think the current answer is you cannot. According to AWS Glue Documentation:

Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

But even when I try to include a normal python written library in S3, the Glue job failed because of some HDFS permission problem. If you find a way to solve this, please let me know as well.


If you don't have pure python libraries and still want to use then you can use below script to use it in your Glue code:

import osimport sitefrom setuptools.command import easy_installinstall_path = os.environ['GLUE_INSTALLATION']easy_install.main( ["--install-dir", install_path, "<library-name>"] )reload(site)import <installed library>