Uploading a Dataframe to AWS S3 Bucket from SageMaker

python pandas amazon-web-services amazon-s3 amazon-sagemaker

One way to solve this would be to save the CSV to the local storage on the SageMaker notebook instance, and then use the S3 API's via boto3 to upload the file as an s3 object. S3 docs for upload_file() available here.

Note, you'll need to ensure that your SageMaker hosted notebook instance has proper ReadWrite permissions in its IAM role, otherwise you'll receive a permissions error.

# code you already have, saving the file locally to whatever directory you wishfile_name = "mydata.csv" df.to_csv(file_name)

# instantiate S3 client and upload to s3import boto3s3 = boto3.resource('s3')s3.meta.client.upload_file(file_name, 'YOUR_S3_BUCKET_NAME', 'DESIRED_S3_OBJECT_NAME')

Alternatively, upload_fileobj() may help for parallelizing as a multi-part upload.

python pandas amazon-web-services amazon-s3 amazon-sagemaker

You can use boto3 to upload a file but, given that you're working with dataframe and pandas you should consider dask. You can install it via conda install dask s3fs

import dask.dataframe as dd

Read from S3

df = dd.read_csv('s3://{}/{}'.format(bucket, data2read),                 storage_options={'key': AWS_ACCESS_KEY_ID,                                   'secret': AWS_SECRET_ACCESS_KEY})

Update

Now if you want to use this file as a pandas dataframe you should compute it as

df = df.compute()

Write to S3

To write back to S3 you should first load your df to dask with the number of partition (must be specified) you need

df = dd.from_pandas(df, npartitions=N)

And then you can upload to S3

df.to_csv('s3://{}/{}'.format(bucket, data2write),          storage_options={'key': AWS_ACCESS_KEY_ID,                           'secret': AWS_SECRET_ACCESS_KEY})

Update

Despite the API is similar the to_csv in pandas is not the same as the one in dask in particular the latter has the storage_options parameter.Furthermore dask doesn't save to a unique file. Let me explain: if you decide that to write to s3://my_bucket/test.csv with dask then instead of have a file called test.csv you are going to have a folder with that name that contain N files where N is the number of partitions we decided before.

Final Note

I understand that it could feel strange to save to multiple files but given that dask read all files on a folder, once you get used to, it could be very convenient.

CodeHunter

Uploading a Dataframe to AWS S3 Bucket from SageMaker

Read from S3

Update

Write to S3

Update

Final Note

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last