Plotting confidence intervals for Maximum Likelihood Estimate Plotting confidence intervals for Maximum Likelihood Estimate numpy numpy

Plotting confidence intervals for Maximum Likelihood Estimate


Looks like you're ok on the first part, so I'll tackle your second and third points.

There are plenty of ways to fit smooth curves, with scipy.interpolate and splines, or with scipy.optimize.curve_fit. Personally, I prefer curve_fit, because you can supply your own function and let it fit the parameters for you.

Alternatively, if you don't want to learn a parametric function, you could do simple rolling-window smoothing with numpy.convolve.

As for code quality: you're not taking advantage of numpy's speed, because you're doing things in pure python. I would write your (existing) code like this:

from __future__ import divisionimport numpy as npimport matplotlib.pyplot as plt# N is the true number of books.# t is the number of weeks.# unk is the true number of repeats found t = 30unk = 3def numberrepeats(N, t, iters):    rand = np.random.randint(0, N, size=(t, iters))    return t - np.array([len(set(r)) for r in rand])iters = 1000ydata = np.empty(500-10)for N in xrange(10,500):    sampledunk = np.count_nonzero(numberrepeats(N,t,iters) == unk)    ydata[N-10] = sampledunk/itersprint "MLE is", np.argmax(ydata)xdata = range(10, 500)print len(xdata), len(ydata)plt.plot(xdata,ydata)plt.show()

It's probably possible to optimize this even more, but this change brings your code's runtime from ~30 seconds to ~2 seconds on my machine.


The a simple (numerical) way to get a confidence interval is simply to run your script many times, and see how much your estimate varies. You can use that standard deviation to calculate the confidence interval.

In the interest of time, another option is to run a bunch of trials at each value of N (I used 2000), and then use random subsampling of those trials to get an estimate of the estimator standard deviation. Basically, this involves selecting a subset of the trials, generating your likelihood curve using that subset, then finding the maximum of that curve to get your estimator. You do this over many subsets and this gives you a bunch of estimators, which you can use to find a confidence interval on your estimator. My full script is as follows:

import numpy as npt = 30k = 3def trial(N):    return t - len(np.unique(np.random.randint(0, N, size=t)))def trials(N, n_trials):    return np.asarray([trial(N) for i in xrange(n_trials)])n_trials = 2000Ns = np.arange(1, 501)results = np.asarray([trials(N, n_trials=n_trials) for N in Ns])def likelihood(results):    L = (results == 3).mean(-1)    # boxcar filtering    n = 10    L = np.convolve(L, np.ones(n) / float(n), mode='same')    return Ldef max_likelihood_estimate(Ns, results):    i = np.argmax(likelihood(results))    return Ns[i]def max_likelihood(Ns, results):    # calculate mean from all trials    mean = max_likelihood_estimate(Ns, results)    # randomly subsample results to estimate std    n_samples = 100    sample_frac = 0.25    estimates = np.zeros(n_samples)    for i in xrange(n_samples):        mask = np.random.uniform(size=results.shape[1]) < sample_frac        estimates[i] = max_likelihood_estimate(Ns, results[:,mask])    std = estimates.std()    sterr = std * np.sqrt(sample_frac) # is this mathematically sound?    ci = (mean - 1.96*sterr, mean + 1.96*sterr)    return mean, std, sterr, cimean, std, sterr, ci = max_likelihood(Ns, results)print "Max likelihood estimate: ", meanprint "Max likelihood 95% ci: ", ci

There are two drawbacks to this method. One is that, since you're taking many subsamples from the same set of trials, your estimates are not independent. To limit the effect of this, I only used 25% of the results for each subset. Another drawback is that each subsample is only a fraction of your data, so estimates derived from these subsets will have more variance than estimates derived from running the full script many times. To account for this, I computed the standard error as the standard deviation divided by the square root of 4, since I had four times as much data in my full data set than in one of the subsamples. However, I'm not familiar enough with Monte Carlo theory to know if this is mathematically sound. Running my script a number of times did seem to indicate that my results were reasonable.

Lastly, I did use a boxcar filter on the likelihood curves to smooth them out a bit. Ideally, this should improve results, but even with the filtering there was still a considerable amount of variability in the results. When calculating the value for the overall estimator, I wasn't sure if it would be better compute one likelihood curve from all the results and use the max of that (this is what I ended up doing), or to use the mean of all the subset estimators. Using the mean of the subset estimators might be able to help cancel out some of the roughness in the curves that remains after filtering, but I'm not sure on this.


Here is an answer to your first question and a pointer to a solution for the second:

plot(xdata,ydata)#  calculate the cumulative distribution functioncdf = np.cumsum(ydata)/sum(ydata)# get the left and right boundary of the interval that contains 95% of the probability mass right=argmax(cdf>0.975)left=argmax(cdf>0.025)# indicate confidence interval with vertical linesvlines(xdata[left], 0, ydata[left])vlines(xdata[right], 0, ydata[right])# hatch confidence intervalfill_between(xdata[left:right], ydata[left:right], facecolor='blue', alpha=0.5)

This produces the following figure:enter image description here

I'll try to answer question 3 when I have more time :)