Fastest way to compute entropy of each numpy array row?

python performance numpy entropy information-theory

scipy.special.entr computes -x*log(x) for each element in an array. After calling that, you can sum the rows.

Here's an example. First, create an array p of positive values whose rows sum to 1:

In [23]: np.random.seed(123)In [24]: x = np.random.rand(3, 10)In [25]: p = x/x.sum(axis=1, keepdims=True)In [26]: pOut[26]: array([[ 0.12798052,  0.05257987,  0.04168536,  0.1013075 ,  0.13220688,         0.07774843,  0.18022149,  0.1258417 ,  0.08837421,  0.07205402],       [ 0.08313743,  0.17661773,  0.1062474 ,  0.01445742,  0.09642919,         0.17878489,  0.04420998,  0.0425045 ,  0.12877228,  0.1288392 ],       [ 0.11793032,  0.15790292,  0.13467074,  0.11358463,  0.13429674,         0.06003561,  0.06725376,  0.0424324 ,  0.05459921,  0.11729367]])In [27]: p.shapeOut[27]: (3, 10)In [28]: p.sum(axis=1)Out[28]: array([ 1.,  1.,  1.])

Now compute the entropy of each row. entr uses the natural logarithm, so to get the base-2 log, divide the result by log(2).

In [29]: from scipy.special import entrIn [30]: entr(p).sum(axis=1)Out[30]: array([ 2.22208731,  2.14586635,  2.22486581])In [31]: entr(p).sum(axis=1)/np.log(2)Out[31]: array([ 3.20579434,  3.09583074,  3.20980287])

If you don't want the dependency on scipy, you can use the explicit formula:

In [32]: (-p*np.log2(p)).sum(axis=1)Out[32]: array([ 3.20579434,  3.09583074,  3.20980287])

python performance numpy entropy information-theory

As @Warren pointed out, it's unclear from your question whether you are starting out from an array of probabilities, or from the raw samples themselves. In my answer I've assumed the latter, in which case the main bottleneck will be computing the bin counts over each row.

Assuming that each vector of samples is relatively long, the fastest way to do this will probably be to use np.bincount:

import numpy as npdef entropy(x):    """    x is assumed to be an (nsignals, nsamples) array containing integers between    0 and n_unique_vals    """    x = np.atleast_2d(x)    nrows, ncols = x.shape    nbins = x.max() + 1    # count the number of occurrences for each unique integer between 0 and x.max()    # in each row of x    counts = np.vstack((np.bincount(row, minlength=nbins) for row in x))    # divide by number of columns to get the probability of each unique value    p = counts / float(ncols)    # compute Shannon entropy in bits    return -np.sum(p * np.log2(p), axis=1)

Although Warren's method of computing the entropies from the probability values using entr is slightly faster than using the explicit formula, in practice this is likely to represent a tiny fraction of the total runtime compared to the time taken to compute the bin counts.

Test correctness for a single row:

vals = np.arange(3)prob = np.array([0.1, 0.7, 0.2])row = np.random.choice(vals, p=prob, size=1000000)print("theoretical H(x): %.6f, empirical H(x): %.6f" %      (-np.sum(prob * np.log2(prob)), entropy(row)[0]))# theoretical H(x): 1.156780, empirical H(x): 1.157532

Test speed:

In [1]: %%timeit x = np.random.choice(vals, p=prob, size=(1000, 10000))   ....: entropy(x)   ....: 10 loops, best of 3: 34.6 ms per loop

If your data don't consist of integer indices between 0 and the number of unique values, you can convert them into this format using np.unique:

y = np.random.choice([2.5, 3.14, 42], p=prob, size=(1000, 10000))unq, x = np.unique(y, return_inverse=True)x.shape = y.shape

CodeHunter

Fastest way to compute entropy of each numpy array row?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last