sampling multinomial from small log probability vectors in numpy/scipy

python numpy scipy probability precision

First of all, I believe the problem you're encountering is because you're normalizing your probabilities incorrectly. This line is incorrect:

a = np.exp(l) / scipy.misc.logsumexp(l)

You're dividing a probability by a log probability, which makes no sense. Instead you probably want

a = np.exp(l - scipy.misc.logsumexp(l))

If you do that, you find a = [1, 0] and your multinomial sampler works as expected up to floating point precision in the second probability.

A Solution for Small N: Histograms

That said, if you still need more precision and performance is not as much of a concern, one way you could make progress is by implementing a multinomial sampler from scratch, and then modifying this to work at higher precision.

NumPy's multinomial function is implemented in Cython, and essentially performs a loop over a number of binomial samples and combines them into a multinomial sample.and you can call it like this:

np.random.multinomial(10, [0.1, 0.2, 0.7])# [0, 1, 9]

(Note that the precise output values here & below are random, and will change from call to call).

Another way you might implement a multinomial sampler is to generate N uniform random values, then compute the histogram with bins defined by the cumulative probabilities:

def multinomial(N, p):    rand = np.random.uniform(size=N)    p_cuml = np.cumsum(np.hstack([[0], p]))    p_cuml /= p_cuml[-1]    return np.histogram(rand, bins=p_cuml)[0]multinomial(10, [0.1, 0.2, 0.7])# [1, 1, 8]

With this method in mind, we can think about doing things to higher precision by keeping everything in log-space. The main trick is to realize that the log of uniform random deviates is equivalent to the negative of exponential random deviates, and so you can do everything above without ever leaving log space:

def multinomial_log(N, logp):    log_rand = -np.random.exponential(size=N)    logp_cuml = np.logaddexp.accumulate(np.hstack([[-np.inf], logp]))    logp_cuml -= logp_cuml[-1]    return np.histogram(log_rand, bins=logp_cuml)[0]multinomial_log(10, np.log([0.1, 0.2, 0.7]))# [1, 2, 7]

The resulting multinomial draws will maintain precision even for very small values in the p array.Unfortunately, these histogram-based solutions will be much slower than the native numpy.multinomial function, so if performance is an issue you may need another approach. One option would be to adapt the Cython code linked above to work in log-space, using similar mathematical tricks as I used here.

A Solution for Large N: Poisson Approximation

The problem with the above solution is that as N grows large, it becomes very slow.I was thinking about this and realized there's a more efficient way forward, despite np.random.multinomial failing for probabilities smaller than 1E-16 or so.

Here's an example of that failure: on a 64-bit machine, this will always give zero for the first entry because of the way the code is implemented, when in reality it should give something near 10:

np.random.multinomial(1E18, [1E-17, 1])# array([                  0, 1000000000000000000])

If you dig into the source, you can trace this issue to the binomial function upon which the multinomial function is built. The cython code internally does something like this:

def multinomial_basic(N, p, size=None):    results = np.array([np.random.binomial(N, pi, size) for pi in p])    results[-1] = int(N) - results[:-1].sum(0)    return np.rollaxis(results, 0, results.ndim)multinomial_basic(1E18, [1E-17, 1])# array([                  0, 1000000000000000000])

The problem is that the binomial function chokes on very small values of p – this is because the algorithm computes the value (1 - p), so the value of p is limited by floating-point precision.

So what can we do? Well, it turns out that for small values of p, the Poisson distribution is an extremely good approximation of the binomial distribution, and the implementation doesn't have these issues. So we can build a robust multinomial function based on a robust binomial sampler that switches to a Poisson sampler at small p:

def binomial_robust(N, p, size=None):    if p < 1E-7:        return np.random.poisson(N * p, size)    else:        return np.random.binomial(N, p, size)def multinomial_robust(N, p, size=None):    results = np.array([binomial_robust(N, pi, size) for pi in p])    results[-1] = int(N) - results[:-1].sum(0)    return np.rollaxis(results, 0, results.ndim)multinomial_robust(1E18, [1E-17, 1])array([                 12, 999999999999999988])

The first entry is nonzero and near 10 as expected! Note that we can't use N larger than 1E18, because it will overflow the long integer.But we can confirm that our approach works for smaller probabilities using the size parameter, and averaging over results:

p = [1E-23, 1E-22, 1E-21, 1E-20, 1]size = int(1E6)multinomial_robust(1E18, p, size).mean(0)# array([  1.70000000e-05,   9.00000000e-05,   9.76000000e-04,#          1.00620000e-02,   1.00000000e+18])

We see that even for these very small probabilities, the multinomial values are turning up in the right proportion. The result is a very robust and very fast approximation to the multinomial distribution for small p.

CodeHunter

sampling multinomial from small log probability vectors in numpy/scipy

A Solution for Small N: Histograms

A Solution for Large N: Poisson Approximation

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last