How to import a text file on AWS S3 into pandas without writing to disk How to import a text file on AWS S3 into pandas without writing to disk python python

How to import a text file on AWS S3 into pandas without writing to disk


pandas uses boto for read_csv, so you should be able to:

import botodata = pd.read_csv('s3://bucket....csv')

If you need boto3 because you are on python3.4+, you can

import boto3import ios3 = boto3.client('s3')obj = s3.get_object(Bucket='bucket', Key='key')df = pd.read_csv(io.BytesIO(obj['Body'].read()))

Since version 0.20.1 pandas uses s3fs, see answer below.


Now pandas can handle S3 URLs. You could simply do:

import pandas as pdimport s3fsdf = pd.read_csv('s3://bucket-name/file.csv')

You need to install s3fs if you don't have it. pip install s3fs

Authentication

If your S3 bucket is private and requires authentication, you have two options:

1- Add access credentials to your ~/.aws/credentials config file

[default]aws_access_key_id=AKIAIOSFODNN7EXAMPLEaws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Or

2- Set the following environment variables with their proper values:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token


This is now supported in latest pandas. See

http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files

eg.,

df = pd.read_csv('s3://pandas-test/tips.csv')