Fast way to read interleaved data? Fast way to read interleaved data? numpy numpy

Fast way to read interleaved data?


The best way to really improve the performance is to get rid of the Python loop over all samples and let NumPy do this loop in compiled C code. This is a bit tricky to achieve, but it is possible.

First, you need a bit of preparation. As pointed out by Justin Peel, the pattern in which the samples are arranged repeats after some number of steps. If d_1, ..., d_k are the divisors for your k data streams and b_1, ..., b_k are the sample sizes of the streams in bytes, and lcm is the least common multiple of these divisors, then

N = lcm*sum(b_1/d_1+...+b_k/d_k)

will be the number of bytes which the pattern of streams will repeat after. If you have figured out which stream each of the first N bytes belongs to, you can simply repeat this pattern.

You can now build the array of stream indices for the first N bytes by something similar to

stream_index = []for sample_num in range(lcm):    stream_index += [i for i, ch in enumerate(all_channels)                     if ch.samples_for(sample_num)]repeat_count = [b[i] for i in stream_index]stream_index = numpy.array(stream_index).repeat(repeat_count)

Here, d is the sequence d_1, ..., d_k and b is the sequence b_1, ..., b_k.

Now you can do

data = numpy.fromfile(my_file, dtype=numpy.uint8).reshape(-1, N)streams = [data[:,stream_index == i].ravel() for i in range(k)]

You possibly need to pad the data a bit at the end to make the reshape() work.

Now you have all the bytes belonging to each stream in separate NumPy arrays. You can reinterpret the data by simply assigning to the dtype attribute of each stream. If you want the first stream to be intepreted as big endian integers, simply write

streams[0].dtype = ">i"

This won't change the data in the array in any way, just the way it is interpreted.

This may look a bit cryptic, but should be much better performance-wise.


Replace channel.samples_for(sample_num) with a iter_channels(channels_config) iterator that keeps some internal state and lets you read the file in one pass. Use it like this:

for (chan, sample_data) in izip(iter_channels(), data):    decoded_data = chan.decode(sample_data)

To implement the iterator, think of a base clock with a period of one. The periods of the various channels are integers. Iterate the channels in order, and emit a channel if the clock modulo its period is zero.

for i in itertools.count():    for chan in channels:        if i % chan.period == 0:            yield chan


The grouper() recipe along with itertools.izip() should be of some help here.