Write pandas dataframe as compressed CSV directly to Amazon s3 bucket?
Here's a solution in Python 3.5.2 using Pandas 0.20.1.
The source DataFrame can be read from a S3, a local CSV, or whatever.
import boto3import gzipimport pandas as pdfrom io import BytesIO, TextIOWrapperdf = pd.read_csv('s3://ramey/test.csv')gz_buffer = BytesIO()with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file: df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)s3_resource = boto3.resource('s3')s3_object = s3_resource.Object('ramey', 'new-file.csv.gz')s3_object.put(Body=gz_buffer.getvalue())
There is a more elegant solution using smart-open (https://pypi.org/project/smart-open/)
import pandas as pdfrom smart_open import opendf.to_csv(open('s3://bucket/prefix/filename.csv.gz','w'),index = False)
If you want streaming writes (to not hold (de)compressed CSV in memory), you can do this:
import s3fsimport ioimport gzip def write_df_to_s3(df, filename, path): s3 = s3fs.S3FileSystem(anon=False) with s3.open(path, 'wb') as f: gz = gzip.GzipFile(filename, mode='wb', compresslevel=9, fileobj=f) buf = io.TextIOWrapper(gz) df.to_csv(buf, index=False, encoding='UTF_8') gz.flush() gz.close()
TextIOWrapper is needed until this issue is fixed: https://github.com/pandas-dev/pandas/issues/19827