Why is a BSON serialized numpy array much bigger than the original? Why is a BSON serialized numpy array much bigger than the original? numpy numpy

Why is a BSON serialized numpy array much bigger than the original?


The reason for this increased number of bytes is how BSON saves the data. You can find this information in the BSON specification, but let's look at a concrete example:

import numpy as npimport bsonnpdata = np.arange(5, dtype='B') * 11listdata = npdata.tolist()bsondata = bson.BSON.encode({"rows": rows, "cols": cols, "data": listdata})print([hex(b) for b in bsondata])

Here, we store an array with values [0, 11, 22, 33, 44, 55] as BSON and print the resulting binary data. Below I have annotated the result to explain what's going on:

['0x47', '0x0', '0x0', '0x0',  # total number of bytes in the document # First element in document     '0x4',  # Array     '0x64', '0x61', '0x74', '0x61', '0x0',  # key: "data"     # subdocument (data array)         '0x4b',  '0x0', '0x0', '0x0',  # total number of bytes         # first element in data array             '0x10',                        # 32 bit integer             '0x30', '0x0',                 # key: "0"             '0x0', '0x0', '0x0', '0x0',    # value: 0         # second element in data array             '0x10',                        # 32 bit integer             '0x31', '0x0',                 # key: "1"             '0xb', '0x0', '0x0', '0x0',    # value: 11         # third element in data array             '0x10',                        # 32 bit integer             '0x32', '0x0',                 # key: "2"             '0x16', '0x0', '0x0', '0x0',   # value: 22              # ...]

In addition to some format overhead, each value of the array is rather wastefully encoded with 7 bytes: 1 byte to specify the data type, 2 bytes for a string containing the index (three bytes for indices >=10, four bytes for indices >=100, ...) and 4 bytes for the 32 bit integer value.

This at least explains why the BSON data is so much bigger than the original array.

I found two libraries GitHub - mongodb/bson-numpy and GitHub - ajdavis/bson-numpy which may do a better job of encoding numby arrays in BSON. However, I did not try them, so I can't say if that is the case or if they even work correctly.