Saving dictionaries to file (numpy and Python 2/3 friendly) Saving dictionaries to file (numpy and Python 2/3 friendly) python python

Saving dictionaries to file (numpy and Python 2/3 friendly)


After asking this two years ago, I starting coding my own HDF5-based replacement of pickle/np.save. Ever since, it has matured into a stable package, so I thought I would finally answer and accept my own question because it is by design exactly what I was looking for:


I recently found myself with a similar problem, for which I wrote a couple of functions for saving the contents of dicts to a group in a PyTables file, and loading them back into dicts.

They process nested dictionary and group structures recursively, and handle objects with types that are not natively supported by PyTables by pickling them and storing them as string arrays. It's not perfect, but at least things like numpy arrays will be stored efficiently. There's also a check included to avoid inadvertently loading enormous structures into memory when reading the group contents back into a dict.

import tablesimport cPickledef dict2group(f, parent, groupname, dictin, force=False, recursive=True):    """    Take a dict, shove it into a PyTables HDF5 file as a group. Each item in    the dict must have a type and shape compatible with PyTables Array.    If 'force == True', any existing child group of the parent node with the    same name as the new group will be overwritten.    If 'recursive == True' (default), new groups will be created recursively    for any items in the dict that are also dicts.    """    try:        g = f.create_group(parent, groupname)    except tables.NodeError as ne:        if force:            pathstr = parent._v_pathname + '/' + groupname            f.removeNode(pathstr, recursive=True)            g = f.create_group(parent, groupname)        else:            raise ne    for key, item in dictin.iteritems():        if isinstance(item, dict):            if recursive:                dict2group(f, g, key, item, recursive=True)        else:            if item is None:                item = '_None'            f.create_array(g, key, item)    return gdef group2dict(f, g, recursive=True, warn=True, warn_if_bigger_than_nbytes=100E6):    """    Traverse a group, pull the contents of its children and return them as    a Python dictionary, with the node names as the dictionary keys.    If 'recursive == True' (default), we will recursively traverse child    groups and put their children into sub-dictionaries, otherwise sub-    groups will be skipped.    Since this might potentially result in huge arrays being loaded into    system memory, the 'warn' option will prompt the user to confirm before    loading any individual array that is bigger than some threshold (default    is 100MB)    """    def memtest(child, threshold=warn_if_bigger_than_nbytes):        mem = child.size_in_memory        if mem > threshold:            print '[!] "%s" is %iMB in size [!]' % (child._v_pathname, mem / 1E6)            confirm = raw_input('Load it anyway? [y/N] >>')            if confirm.lower() == 'y':                return True            else:                print "Skipping item \"%s\"..." % g._v_pathname        else:            return True    outdict = {}    for child in g:        try:            if isinstance(child, tables.group.Group):                if recursive:                    item = group2dict(f, child)                else:                    continue            else:                if memtest(child):                    item = child.read()                    if isinstance(item, str):                        if item == '_None':                            item = None                else:                    continue            outdict.update({child._v_name: item})        except tables.NoSuchNodeError:            warnings.warn('No such node: "%s", skipping...' % repr(child))            pass    return outdict

It's also worth mentioning joblib.dump and joblib.load, which tick all of your boxes apart from Python 2/3 cross-compatibility. Under the hood they use np.save for numpy arrays and cPickle for everything else.


I tried playing with np.memmap for saving an array of dictionaries. Say we have the dictionary:

a = np.array([str({'a':1, 'b':2, 'c':[1,2,3,{'d':4}]}])

first I tried to directly save it to a memmap:

f = np.memmap('stack.array', dtype=dict, mode='w+', shape=(100,))f[0] = d# CRASHES when reopening since it looses the memory pointerf = np.memmap('stack.array', dtype=object, mode='w+', shape=(100,))f[0] = d# CRASHES when reopening for the same reason

the way it worked is converting the dictionary to a string:

f = np.memmap('stack.array', dtype='|S1000', mode='w+', shape=(100,))f[0] = str(a)

this works and afterwards you can eval(f[0]) to get the value back.

I do not know the advantage of this approach over the others, but it deserves a closer look.