Write pandas dataframe as compressed CSV directly to Amazon s3 bucket?

python csv pandas amazon-web-services amazon-s3

Here's a solution in Python 3.5.2 using Pandas 0.20.1.

The source DataFrame can be read from a S3, a local CSV, or whatever.

import boto3import gzipimport pandas as pdfrom io import BytesIO, TextIOWrapperdf = pd.read_csv('s3://ramey/test.csv')gz_buffer = BytesIO()with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:    df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)s3_resource = boto3.resource('s3')s3_object = s3_resource.Object('ramey', 'new-file.csv.gz')s3_object.put(Body=gz_buffer.getvalue())

python csv pandas amazon-web-services amazon-s3

There is a more elegant solution using smart-open (https://pypi.org/project/smart-open/)

import pandas as pdfrom smart_open import opendf.to_csv(open('s3://bucket/prefix/filename.csv.gz','w'),index = False)

python csv pandas amazon-web-services amazon-s3

If you want streaming writes (to not hold (de)compressed CSV in memory), you can do this:

import s3fsimport ioimport gzip    def write_df_to_s3(df, filename, path):        s3 = s3fs.S3FileSystem(anon=False)        with s3.open(path, 'wb') as f:            gz = gzip.GzipFile(filename, mode='wb', compresslevel=9, fileobj=f)            buf = io.TextIOWrapper(gz)            df.to_csv(buf, index=False, encoding='UTF_8')            gz.flush()            gz.close()

TextIOWrapper is needed until this issue is fixed: https://github.com/pandas-dev/pandas/issues/19827

CodeHunter

Write pandas dataframe as compressed CSV directly to Amazon s3 bucket?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last