Saving dictionaries to file (numpy and Python 2/3 friendly)

python python-3.x numpy hdf5 pytables

After asking this two years ago, I starting coding my own HDF5-based replacement of pickle/np.save. Ever since, it has matured into a stable package, so I thought I would finally answer and accept my own question because it is by design exactly what I was looking for:

https://github.com/uchicago-cs/deepdish

python python-3.x numpy hdf5 pytables

I recently found myself with a similar problem, for which I wrote a couple of functions for saving the contents of dicts to a group in a PyTables file, and loading them back into dicts.

They process nested dictionary and group structures recursively, and handle objects with types that are not natively supported by PyTables by pickling them and storing them as string arrays. It's not perfect, but at least things like numpy arrays will be stored efficiently. There's also a check included to avoid inadvertently loading enormous structures into memory when reading the group contents back into a dict.

import tablesimport cPickledef dict2group(f, parent, groupname, dictin, force=False, recursive=True):    """    Take a dict, shove it into a PyTables HDF5 file as a group. Each item in    the dict must have a type and shape compatible with PyTables Array.    If 'force == True', any existing child group of the parent node with the    same name as the new group will be overwritten.    If 'recursive == True' (default), new groups will be created recursively    for any items in the dict that are also dicts.    """    try:        g = f.create_group(parent, groupname)    except tables.NodeError as ne:        if force:            pathstr = parent._v_pathname + '/' + groupname            f.removeNode(pathstr, recursive=True)            g = f.create_group(parent, groupname)        else:            raise ne    for key, item in dictin.iteritems():        if isinstance(item, dict):            if recursive:                dict2group(f, g, key, item, recursive=True)        else:            if item is None:                item = '_None'            f.create_array(g, key, item)    return gdef group2dict(f, g, recursive=True, warn=True, warn_if_bigger_than_nbytes=100E6):    """    Traverse a group, pull the contents of its children and return them as    a Python dictionary, with the node names as the dictionary keys.    If 'recursive == True' (default), we will recursively traverse child    groups and put their children into sub-dictionaries, otherwise sub-    groups will be skipped.    Since this might potentially result in huge arrays being loaded into    system memory, the 'warn' option will prompt the user to confirm before    loading any individual array that is bigger than some threshold (default    is 100MB)    """    def memtest(child, threshold=warn_if_bigger_than_nbytes):        mem = child.size_in_memory        if mem > threshold:            print '[!] "%s" is %iMB in size [!]' % (child._v_pathname, mem / 1E6)            confirm = raw_input('Load it anyway? [y/N] >>')            if confirm.lower() == 'y':                return True            else:                print "Skipping item \"%s\"..." % g._v_pathname        else:            return True    outdict = {}    for child in g:        try:            if isinstance(child, tables.group.Group):                if recursive:                    item = group2dict(f, child)                else:                    continue            else:                if memtest(child):                    item = child.read()                    if isinstance(item, str):                        if item == '_None':                            item = None                else:                    continue            outdict.update({child._v_name: item})        except tables.NoSuchNodeError:            warnings.warn('No such node: "%s", skipping...' % repr(child))            pass    return outdict

It's also worth mentioning joblib.dump and joblib.load, which tick all of your boxes apart from Python 2/3 cross-compatibility. Under the hood they use np.save for numpy arrays and cPickle for everything else.

python python-3.x numpy hdf5 pytables

I tried playing with np.memmap for saving an array of dictionaries. Say we have the dictionary:

a = np.array([str({'a':1, 'b':2, 'c':[1,2,3,{'d':4}]}])

first I tried to directly save it to a memmap:

f = np.memmap('stack.array', dtype=dict, mode='w+', shape=(100,))f[0] = d# CRASHES when reopening since it looses the memory pointerf = np.memmap('stack.array', dtype=object, mode='w+', shape=(100,))f[0] = d# CRASHES when reopening for the same reason

the way it worked is converting the dictionary to a string:

f = np.memmap('stack.array', dtype='|S1000', mode='w+', shape=(100,))f[0] = str(a)

this works and afterwards you can eval(f[0]) to get the value back.

I do not know the advantage of this approach over the others, but it deserves a closer look.

CodeHunter

Saving dictionaries to file (numpy and Python 2/3 friendly)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last