Fast way to read interleaved data?

python optimization numpy binary-data

The best way to really improve the performance is to get rid of the Python loop over all samples and let NumPy do this loop in compiled C code. This is a bit tricky to achieve, but it is possible.

First, you need a bit of preparation. As pointed out by Justin Peel, the pattern in which the samples are arranged repeats after some number of steps. If d_1, ..., d_k are the divisors for your k data streams and b_1, ..., b_k are the sample sizes of the streams in bytes, and lcm is the least common multiple of these divisors, then

N = lcm*sum(b_1/d_1+...+b_k/d_k)

will be the number of bytes which the pattern of streams will repeat after. If you have figured out which stream each of the first N bytes belongs to, you can simply repeat this pattern.

You can now build the array of stream indices for the first N bytes by something similar to

stream_index = []for sample_num in range(lcm):    stream_index += [i for i, ch in enumerate(all_channels)                     if ch.samples_for(sample_num)]repeat_count = [b[i] for i in stream_index]stream_index = numpy.array(stream_index).repeat(repeat_count)

Here, d is the sequence d_1, ..., d_k and b is the sequence b_1, ..., b_k.

Now you can do

data = numpy.fromfile(my_file, dtype=numpy.uint8).reshape(-1, N)streams = [data[:,stream_index == i].ravel() for i in range(k)]

You possibly need to pad the data a bit at the end to make the reshape() work.

Now you have all the bytes belonging to each stream in separate NumPy arrays. You can reinterpret the data by simply assigning to the dtype attribute of each stream. If you want the first stream to be intepreted as big endian integers, simply write

streams[0].dtype = ">i"

This won't change the data in the array in any way, just the way it is interpreted.

This may look a bit cryptic, but should be much better performance-wise.

python optimization numpy binary-data

Replace channel.samples_for(sample_num) with a iter_channels(channels_config) iterator that keeps some internal state and lets you read the file in one pass. Use it like this:

for (chan, sample_data) in izip(iter_channels(), data):    decoded_data = chan.decode(sample_data)

To implement the iterator, think of a base clock with a period of one. The periods of the various channels are integers. Iterate the channels in order, and emit a channel if the clock modulo its period is zero.

for i in itertools.count():    for chan in channels:        if i % chan.period == 0:            yield chan

python optimization numpy binary-data

The grouper() recipe along with itertools.izip() should be of some help here.

CodeHunter

Fast way to read interleaved data?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last