Retrieving subfolders names in S3 bucket from boto3 Retrieving subfolders names in S3 bucket from boto3 python python

Retrieving subfolders names in S3 bucket from boto3


Below piece of code returns ONLY the 'subfolders' in a 'folder' from s3 bucket.

import boto3bucket = 'my-bucket'#Make sure you provide / in the endprefix = 'prefix-name-with-slash/'  client = boto3.client('s3')result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/')for o in result.get('CommonPrefixes'):    print 'sub folder : ', o.get('Prefix')

For more details, you can refer to https://github.com/boto/boto3/issues/134


S3 is an object storage, it doesn't have real directory structure. The "/" is rather cosmetic.One reason that people want to have a directory structure, because they can maintain/prune/add a tree to the application. For S3, you treat such structure as sort of index or search tag.

To manipulate object in S3, you need boto3.client or boto3.resource, e.g.To list all object

import boto3 s3 = boto3.client("s3")all_objects = s3.list_objects(Bucket = 'bucket-name') 

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects

In fact, if the s3 object name is stored using '/' separator. The more recent version of list_objects (list_objects_v2) allows you to limit the response to keys that begin with the specified prefix.

To limit the items to items under certain sub-folders:

    import boto3     s3 = boto3.client("s3")    response = s3.list_objects_v2(            Bucket=BUCKET,            Prefix ='DIR1/DIR2',            MaxKeys=100 )

Documentation

Another option is using python os.path function to extract the folder prefix. Problem is that this will require listing objects from undesired directories.

import oss3_key = 'first-level/1456753904534/part-00014'filename = os.path.basename(s3_key) foldername = os.path.dirname(s3_key)# if you are not using conventional delimiter like '#' s3_key = 'first-level#1456753904534#part-00014'filename = s3_key.split("#")[-1]

A reminder about boto3 : boto3.resource is a nice high level API. There are pros and cons using boto3.client vs boto3.resource. If you develop internal shared library, using boto3.resource will give you a blackbox layer over the resources used.


Short answer:

  • Use Delimiter='/'. This avoids doing a recursive listing of your bucket. Some answers here wrongly suggest doing a full listing and using some string manipulation to retrieve the directory names. This could be horribly inefficient. Remember that S3 has virtually no limit on the number of objects a bucket can contain. So, imagine that, between bar/ and foo/, you have a trillion objects: you would wait a very long time to get ['bar/', 'foo/'].

  • Use Paginators. For the same reason (S3 is an engineer's approximation of infinity), you must list through pages and avoid storing all the listing in memory. Instead, consider your "lister" as an iterator, and handle the stream it produces.

  • Use boto3.client, not boto3.resource. The resource version doesn't seem to handle well the Delimiter option. If you have a resource, say a bucket = boto3.resource('s3').Bucket(name), you can get the corresponding client with: bucket.meta.client.

Long answer:

The following is an iterator that I use for simple buckets (no version handling).

import osimport boto3from collections import namedtuplefrom operator import attrgetterS3Obj = namedtuple('S3Obj', ['key', 'mtime', 'size', 'ETag'])def s3list(bucket, path, start=None, end=None, recursive=True, list_dirs=True,           list_objs=True, limit=None):    """    Iterator that lists a bucket's objects under path, (optionally) starting with    start and ending before end.    If recursive is False, then list only the "depth=0" items (dirs and objects).    If recursive is True, then list recursively all objects (no dirs).    Args:        bucket:            a boto3.resource('s3').Bucket().        path:            a directory in the bucket.        start:            optional: start key, inclusive (may be a relative path under path, or            absolute in the bucket)        end:            optional: stop key, exclusive (may be a relative path under path, or            absolute in the bucket)        recursive:            optional, default True. If True, lists only objects. If False, lists            only depth 0 "directories" and objects.        list_dirs:            optional, default True. Has no effect in recursive listing. On            non-recursive listing, if False, then directories are omitted.        list_objs:            optional, default True. If False, then directories are omitted.        limit:            optional. If specified, then lists at most this many items.    Returns:        an iterator of S3Obj.    Examples:        # set up        >>> s3 = boto3.resource('s3')        ... bucket = s3.Bucket('bucket-name')        # iterate through all S3 objects under some dir        >>> for p in s3list(bucket, 'some/dir'):        ...     print(p)        # iterate through up to 20 S3 objects under some dir, starting with foo_0010        >>> for p in s3list(bucket, 'some/dir', limit=20, start='foo_0010'):        ...     print(p)        # non-recursive listing under some dir:        >>> for p in s3list(bucket, 'some/dir', recursive=False):        ...     print(p)        # non-recursive listing under some dir, listing only dirs:        >>> for p in s3list(bucket, 'some/dir', recursive=False, list_objs=False):        ...     print(p)"""    kwargs = dict()    if start is not None:        if not start.startswith(path):            start = os.path.join(path, start)        # note: need to use a string just smaller than start, because        # the list_object API specifies that start is excluded (the first        # result is *after* start).        kwargs.update(Marker=__prev_str(start))    if end is not None:        if not end.startswith(path):            end = os.path.join(path, end)    if not recursive:        kwargs.update(Delimiter='/')        if not path.endswith('/'):            path += '/'    kwargs.update(Prefix=path)    if limit is not None:        kwargs.update(PaginationConfig={'MaxItems': limit})    paginator = bucket.meta.client.get_paginator('list_objects')    for resp in paginator.paginate(Bucket=bucket.name, **kwargs):        q = []        if 'CommonPrefixes' in resp and list_dirs:            q = [S3Obj(f['Prefix'], None, None, None) for f in resp['CommonPrefixes']]        if 'Contents' in resp and list_objs:            q += [S3Obj(f['Key'], f['LastModified'], f['Size'], f['ETag']) for f in resp['Contents']]        # note: even with sorted lists, it is faster to sort(a+b)        # than heapq.merge(a, b) at least up to 10K elements in each list        q = sorted(q, key=attrgetter('key'))        if limit is not None:            q = q[:limit]            limit -= len(q)        for p in q:            if end is not None and p.key >= end:                return            yield pdef __prev_str(s):    if len(s) == 0:        return s    s, c = s[:-1], ord(s[-1])    if c > 0:        s += chr(c - 1)    s += ''.join(['\u7FFF' for _ in range(10)])    return s

Test:

The following is helpful to test the behavior of the paginator and list_objects. It creates a number of dirs and files. Since the pages are up to 1000 entries, we use a multiple of that for dirs and files. dirs contains only directories (each having one object). mixed contains a mix of dirs and objects, with a ratio of 2 objects for each dir (plus one object under dir, of course; S3 stores only objects).

import concurrentdef genkeys(top='tmp/test', n=2000):    for k in range(n):        if k % 100 == 0:            print(k)        for name in [            os.path.join(top, 'dirs', f'{k:04d}_dir', 'foo'),            os.path.join(top, 'mixed', f'{k:04d}_dir', 'foo'),            os.path.join(top, 'mixed', f'{k:04d}_foo_a'),            os.path.join(top, 'mixed', f'{k:04d}_foo_b'),        ]:            yield namewith concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:    executor.map(lambda name: bucket.put_object(Key=name, Body='hi\n'.encode()), genkeys())

The resulting structure is:

./dirs/0000_dir/foo./dirs/0001_dir/foo./dirs/0002_dir/foo..../dirs/1999_dir/foo./mixed/0000_dir/foo./mixed/0000_foo_a./mixed/0000_foo_b./mixed/0001_dir/foo./mixed/0001_foo_a./mixed/0001_foo_b./mixed/0002_dir/foo./mixed/0002_foo_a./mixed/0002_foo_b..../mixed/1999_dir/foo./mixed/1999_foo_a./mixed/1999_foo_b

With a little bit of doctoring of the code given above for s3list to inspect the responses from the paginator, you can observe some fun facts:

  • The Marker is really exclusive. Given Marker=topdir + 'mixed/0500_foo_a' will make the listing start after that key (as per the AmazonS3 API), i.e., with .../mixed/0500_foo_b. That's the reason for __prev_str().

  • Using Delimiter, when listing mixed/, each response from the paginator contains 666 keys and 334 common prefixes. It's pretty good at not building enormous responses.

  • By contrast, when listing dirs/, each response from the paginator contains 1000 common prefixes (and no keys).

  • Passing a limit in the form of PaginationConfig={'MaxItems': limit} limits only the number of keys, not the common prefixes. We deal with that by further truncating the stream of our iterator.