Saving dictionaries to file (numpy and Python 2/3 friendly)
After asking this two years ago, I starting coding my own HDF5-based replacement of pickle/np.save
. Ever since, it has matured into a stable package, so I thought I would finally answer and accept my own question because it is by design exactly what I was looking for:
I recently found myself with a similar problem, for which I wrote a couple of functions for saving the contents of dicts to a group in a PyTables file, and loading them back into dicts.
They process nested dictionary and group structures recursively, and handle objects with types that are not natively supported by PyTables by pickling them and storing them as string arrays. It's not perfect, but at least things like numpy arrays will be stored efficiently. There's also a check included to avoid inadvertently loading enormous structures into memory when reading the group contents back into a dict.
import tablesimport cPickledef dict2group(f, parent, groupname, dictin, force=False, recursive=True): """ Take a dict, shove it into a PyTables HDF5 file as a group. Each item in the dict must have a type and shape compatible with PyTables Array. If 'force == True', any existing child group of the parent node with the same name as the new group will be overwritten. If 'recursive == True' (default), new groups will be created recursively for any items in the dict that are also dicts. """ try: g = f.create_group(parent, groupname) except tables.NodeError as ne: if force: pathstr = parent._v_pathname + '/' + groupname f.removeNode(pathstr, recursive=True) g = f.create_group(parent, groupname) else: raise ne for key, item in dictin.iteritems(): if isinstance(item, dict): if recursive: dict2group(f, g, key, item, recursive=True) else: if item is None: item = '_None' f.create_array(g, key, item) return gdef group2dict(f, g, recursive=True, warn=True, warn_if_bigger_than_nbytes=100E6): """ Traverse a group, pull the contents of its children and return them as a Python dictionary, with the node names as the dictionary keys. If 'recursive == True' (default), we will recursively traverse child groups and put their children into sub-dictionaries, otherwise sub- groups will be skipped. Since this might potentially result in huge arrays being loaded into system memory, the 'warn' option will prompt the user to confirm before loading any individual array that is bigger than some threshold (default is 100MB) """ def memtest(child, threshold=warn_if_bigger_than_nbytes): mem = child.size_in_memory if mem > threshold: print '[!] "%s" is %iMB in size [!]' % (child._v_pathname, mem / 1E6) confirm = raw_input('Load it anyway? [y/N] >>') if confirm.lower() == 'y': return True else: print "Skipping item \"%s\"..." % g._v_pathname else: return True outdict = {} for child in g: try: if isinstance(child, tables.group.Group): if recursive: item = group2dict(f, child) else: continue else: if memtest(child): item = child.read() if isinstance(item, str): if item == '_None': item = None else: continue outdict.update({child._v_name: item}) except tables.NoSuchNodeError: warnings.warn('No such node: "%s", skipping...' % repr(child)) pass return outdict
It's also worth mentioning joblib.dump
and joblib.load
, which tick all of your boxes apart from Python 2/3 cross-compatibility. Under the hood they use np.save
for numpy arrays and cPickle
for everything else.
I tried playing with np.memmap
for saving an array of dictionaries. Say we have the dictionary:
a = np.array([str({'a':1, 'b':2, 'c':[1,2,3,{'d':4}]}])
first I tried to directly save it to a memmap
:
f = np.memmap('stack.array', dtype=dict, mode='w+', shape=(100,))f[0] = d# CRASHES when reopening since it looses the memory pointerf = np.memmap('stack.array', dtype=object, mode='w+', shape=(100,))f[0] = d# CRASHES when reopening for the same reason
the way it worked is converting the dictionary to a string:
f = np.memmap('stack.array', dtype='|S1000', mode='w+', shape=(100,))f[0] = str(a)
this works and afterwards you can eval(f[0])
to get the value back.
I do not know the advantage of this approach over the others, but it deserves a closer look.