Compress numpy arrays efficiently

python arrays numpy compression lossless-compression

What I do now:

import gzipimport numpyf = gzip.GzipFile("my_array.npy.gz", "w")numpy.save(file=f, arr=my_array)f.close()

python arrays numpy compression lossless-compression

Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.
Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.
A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].

EDIT An illustration:

from pylab import *import numpyimport numpy.randomimport os.pathimport subprocess# create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)N = 1000000data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \    numpy.random.randn(N) * (1<<7)).astype(int32)numpy.save('data.npy', data)print os.path.getsize('data.npy')# 4000080 uncompressed sizesubprocess.call('xz -9 data.npy', shell=True)print os.path.getsize('data.npy.xz')# 1484192 compressed size# 11.87 bits per sample, ~8 bits of that is noisedata_quantized = data / (1<<8)numpy.save('data_quantized.npy', data_quantized)subprocess.call('xz -9 data_quantized.npy', shell=True)print os.path.getsize('data_quantized.npy.xz')# 318380# still have 16 bits of signal, but only takes 2.55 bits per sample to store it

python arrays numpy compression lossless-compression

The HDF5 file saving with compression can be very quick and efficient: it all depends on the compression algorithm, and whether you want it to be quick while saving, or while reading it back, or both. And, naturally, on the data itself, as it was explained above. GZIP tends to be somewhere in between, but with low compression ratio. BZIP2 is slow on both sides, although with better ratio. BLOSC is one of the algorithms that I have found to get quite compression, and quick on both ends. The downside of BLOSC is that it is not implemented in all implementations of HDF5. Thus your program may not be portable.You always need to make, at least some, tests to select the best configuration for your needs.

CodeHunter

Compress numpy arrays efficiently

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last