Write pandas dataframe as compressed CSV directly to Amazon s3 bucket? Write pandas dataframe as compressed CSV directly to Amazon s3 bucket? pandas pandas

Write pandas dataframe as compressed CSV directly to Amazon s3 bucket?


Here's a solution in Python 3.5.2 using Pandas 0.20.1.

The source DataFrame can be read from a S3, a local CSV, or whatever.

import boto3import gzipimport pandas as pdfrom io import BytesIO, TextIOWrapperdf = pd.read_csv('s3://ramey/test.csv')gz_buffer = BytesIO()with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:    df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)s3_resource = boto3.resource('s3')s3_object = s3_resource.Object('ramey', 'new-file.csv.gz')s3_object.put(Body=gz_buffer.getvalue())


There is a more elegant solution using smart-open (https://pypi.org/project/smart-open/)

import pandas as pdfrom smart_open import opendf.to_csv(open('s3://bucket/prefix/filename.csv.gz','w'),index = False)


If you want streaming writes (to not hold (de)compressed CSV in memory), you can do this:

import s3fsimport ioimport gzip    def write_df_to_s3(df, filename, path):        s3 = s3fs.S3FileSystem(anon=False)        with s3.open(path, 'wb') as f:            gz = gzip.GzipFile(filename, mode='wb', compresslevel=9, fileobj=f)            buf = io.TextIOWrapper(gz)            df.to_csv(buf, index=False, encoding='UTF_8')            gz.flush()            gz.close()

TextIOWrapper is needed until this issue is fixed: https://github.com/pandas-dev/pandas/issues/19827