Shelve is too slow for large dictionaries, what can I do to improve performance? Shelve is too slow for large dictionaries, what can I do to improve performance? database database

Shelve is too slow for large dictionaries, what can I do to improve performance?


For storing a large dictionary of string : number key-value pairs, I'd suggest a JSON-native storage solution such as MongoDB. It has a wonderful API for Python, Pymongo. MongoDB itself is lightweight and incredibly fast, and json objects will natively be dictionaries in Python. This means that you can use your string key as the object ID, allowing for compressed storage and quick lookup.

As an example of how easy the code would be, see the following:

d = {'string1' : 1, 'string2' : 2, 'string3' : 3}from pymongo import Connectionconn = Connection()db = conn['example-database']collection = db['example-collection']for string, num in d.items():    collection.save({'_id' : string, 'value' : num})# testingnewD = {}for obj in collection.find():    newD[obj['_id']] = obj['value']print newD# output is: {u'string2': 2, u'string3': 3, u'string1': 1}

You'd just have to convert back from unicode, which is trivial.


Based on my experience, I would recommend using SQLite3, which comes with Python. It works well with larger databases and key numbers. Millions of keys and gigabytes of data is not a problem. Shelve is totally wasted at that point. Also having separate db-process isn't beneficial, it just requires more context swaps. In my tests I found out that SQLite3 was the preferred option to use, when handling larger data sets locally. Running local database engine like mongo, mysql or postgresql doesn't provide any additional value and also were slower.


I think your problem is due to the fact that you use the writeback=True. The documentation says (emphasis is mine):

Because of Python semantics, a shelf cannot know when a mutable persistent-dictionary entry is modified. By default modified objects are written only when assigned to the shelf (see Example). If the optional writeback parameter is set to True, all entries accessed are also cached in memory, and written back on sync() and close(); this can make it handier to mutate mutable entries in the persistent dictionary, but, if many entries are accessed, it can consume vast amounts of memory for the cache, and it can make the close operation very slow since all accessed entries are written back (there is no way to determine which accessed entries are mutable, nor which ones were actually mutated).

You could avoid using writeback=True and make sure the data is written only once (you have to pay attention that subsequent modifications are going to be lost).

If you believe this is not the right storage option (it's difficult to say without knowing how the data is structured), I suggest sqlite3, it's integrated in python (thus very portable) and has very nice performances. It's somewhat more complicated than a simple key-value store.

See other answers for alternatives.