Getting a data stream from a zipped file sitting in a S3 bucket using boto3 lib and AWS Lambda Getting a data stream from a zipped file sitting in a S3 bucket using boto3 lib and AWS Lambda python-3.x python-3.x

Getting a data stream from a zipped file sitting in a S3 bucket using boto3 lib and AWS Lambda


So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.

import ioimport zipfileimport boto3import syss3 = boto3.resource('s3', 'us-east-1')def stream_zip_file():    count = 0    obj = s3.Object(        bucket_name='MonkeyBusiness',        key='/Daily/Business/Banana/{current-date}/banana.zip'    )    buffer = io.BytesIO(obj.get()["Body"].read())    print (buffer)    z = zipfile.ZipFile(buffer)    foo2 = z.open(z.infolist()[0])    print(sys.getsizeof(foo2))    line_counter = 0    for _ in foo2:        line_counter += 1    print (line_counter)    z.close()if __name__ == '__main__':    stream_zip_file()


This is not the exact answer. But you can try this out.

First, please adapt the answer that mentioned about gzip file with limited memory, this method allow one to stream file chunk by chunk. And boto3 S3 put_object() and upload_fileobj seems allow streaming.

You need to mix and adapt the above mentioned code with following decompression.

stream = cStringIO.StringIO()stream.write(s3_data)stream.seek(0)blocksize = 1 << 16  #64kbwith gzip.GzipFile(fileobj=stream) as decompressor:    while True:        boto3.client.upload_fileobj(decompressor.read(blocksize), "bucket", "key")

I cannot guarantee the above code will works, it is just give you the idea to decompress file and re-uplaod it by chunks. You might even need to pipeline the decompress data to ByteIo and pipeline to upload_fileobj. There is a lot of testing.

if you don't need to decompress the file ASAP, my suggestion is use lambda to put the file into SQS queue. When there is "enough" file, trigger a SPOT instance (that will be pretty cheap) that will read the queue and process the file.