Key: value store in Python for possibly 100 GB of data, without client/server [closed] Key: value store in Python for possibly 100 GB of data, without client/server [closed] python python

Key: value store in Python for possibly 100 GB of data, without client/server [closed]


You can use sqlitedict which provides key-value interface to SQLite database.

SQLite limits page says that theoretical maximum is 140 TB depending on page_size and max_page_count. However, default values for Python 3.5.2-2ubuntu0~16.04.4 (sqlite3 2.6.0), are page_size=1024 and max_page_count=1073741823. This gives ~1100 GB of maximal database size which fits your requirement.

You can use the package like:

from sqlitedict import SqliteDictmydict = SqliteDict('./my_db.sqlite', autocommit=True)mydict['some_key'] = any_picklable_objectprint(mydict['some_key'])for key, value in mydict.items():    print(key, value)print(len(mydict))mydict.close()

Update

About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size pages, which is barely 2MiB (the same Python as above). Here's the script you can use to check it with your data. Before run:

pip install lipsum psutil matplotlib psrecord sqlitedict

sqlitedct.py

#!/usr/bin/env python3import osimport randomfrom contextlib import closingimport lipsumfrom sqlitedict import SqliteDictdef main():    with closing(SqliteDict('./my_db.sqlite', autocommit=True)) as d:        for _ in range(100000):            v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]            d[os.urandom(10)] = vif __name__ == '__main__':    main()

Run it like ./sqlitedct.py & psrecord --plot=plot.png --interval=0.1 $!. In my case it produces this chart:chart

And database file:

$ du -h my_db.sqlite 84M my_db.sqlite


I would consider HDF5 for this. It has several advantages:

  • Usable from many programming languages.
  • Usable from Python via the excellent h5py package.
  • Battle tested, including with large data sets.
  • Supports variable-length string values.
  • Values are addressable by a filesystem-like "path" (/foo/bar).
  • Values can be arrays (and usually are), but do not have to be.
  • Optional built-in compression.
  • Optional "chunking" to allow writing chunks incrementally.
  • Does not require loading the entire data set into memory at once.

It does have some disadvantages too:

  • Extremely flexible, to the point of making it hard to define a single approach.
  • Complex format, not feasible to use without the official HDF5 C library (but there are many wrappers, e.g. h5py).
  • Baroque C/C++ API (the Python one is not so).
  • Little support for concurrent writers (or writer + readers). Writes might need to lock at a coarse granularity.

You can think of HDF5 as a way to store values (scalars or N-dimensional arrays) inside a hierarchy inside a single file (or indeed multiple such files). The biggest problem with just storing your values in a single disk file would be that you'd overwhelm some filesystems; you can think of HDF5 as a filesystem within a file which won't fall down when you put a million values in one "directory."


I know it's an old question, but I wrote something like this long ago:

https://github.com/dagnelies/pysos

It works like a normal python dict, but has the advantage that it's much more efficient than shelve on windows and is also cross-platform, unlike shelve where the data storage differs based on the OS.

To install:

pip install pysos

Usage:

import pysosdb = pysos.Dict('somefile')db['hello'] = 'persistence!'

EDIT: Performance

Just to give a ballpark figure, here is a mini benchmark (on my windows laptop):

import pysost = time.time()import timeN = 100 * 1000db = pysos.Dict("test.db")for i in range(N):    db["key_" + str(i)] = {"some": "object_" + str(i)}db.close()print('PYSOS time:', time.time() - t)# => PYSOS time: 3.424309253692627

The resulting file was about 3.5 Mb big. ...So, very roughly speeking, you could insert 1 mb of data per second.

EDIT: How it works

It writes every time you set a value, but only the key/value pair. So the cost of adding/updating/deleting an item is always the same, although adding only is "better" because lots of updating/deleting leads to data fragmentation in the file (wasted junk bytes). What is kept in memory is the mapping (key -> location in the file), so you just have to ensure there is enough RAM for all those keys. SSD is also highly recommended. 100 MB is easy and fast. 100 GB like posted originally will be a lot, but doable. Even raw reading/writing 100 GB takes quite some time.