Compress numpy arrays efficiently Compress numpy arrays efficiently numpy numpy

Compress numpy arrays efficiently


What I do now:

import gzipimport numpyf = gzip.GzipFile("my_array.npy.gz", "w")numpy.save(file=f, arr=my_array)f.close()


  1. Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.

  2. Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.

  3. A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].

EDIT An illustration:

from pylab import *import numpyimport numpy.randomimport os.pathimport subprocess# create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)N = 1000000data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \    numpy.random.randn(N) * (1<<7)).astype(int32)numpy.save('data.npy', data)print os.path.getsize('data.npy')# 4000080 uncompressed sizesubprocess.call('xz -9 data.npy', shell=True)print os.path.getsize('data.npy.xz')# 1484192 compressed size# 11.87 bits per sample, ~8 bits of that is noisedata_quantized = data / (1<<8)numpy.save('data_quantized.npy', data_quantized)subprocess.call('xz -9 data_quantized.npy', shell=True)print os.path.getsize('data_quantized.npy.xz')# 318380# still have 16 bits of signal, but only takes 2.55 bits per sample to store it


The HDF5 file saving with compression can be very quick and efficient: it all depends on the compression algorithm, and whether you want it to be quick while saving, or while reading it back, or both. And, naturally, on the data itself, as it was explained above. GZIP tends to be somewhere in between, but with low compression ratio. BZIP2 is slow on both sides, although with better ratio. BLOSC is one of the algorithms that I have found to get quite compression, and quick on both ends. The downside of BLOSC is that it is not implemented in all implementations of HDF5. Thus your program may not be portable.You always need to make, at least some, tests to select the best configuration for your needs.