Optimal way to compute pairwise mutual information using numpy Optimal way to compute pairwise mutual information using numpy numpy numpy

Optimal way to compute pairwise mutual information using numpy


I can't suggest a faster calculation for the outer loop over the n*(n-1)/2vectors, but your implementation of calc_MI(x, y, bins) can be simplifiedif you can use scipy version 0.13 or scikit-learn.

In scipy 0.13, the lambda_ argument was added to scipy.stats.chi2_contingencyThis argument controls the statistic that is computed by the function. Ifyou use lambda_="log-likelihood" (or lambda_=0), the log-likelihood ratiois returned. This is also often called the G or G2 statistic. Other thana factor of 2*n (where n is the total number of samples in the contingencytable), this is the mutual information. So you could implement calc_MIas:

from scipy.stats import chi2_contingencydef calc_MI(x, y, bins):    c_xy = np.histogram2d(x, y, bins)[0]    g, p, dof, expected = chi2_contingency(c_xy, lambda_="log-likelihood")    mi = 0.5 * g / c_xy.sum()    return mi

The only difference between this and your implementation is that thisimplementation uses the natural logarithm instead of the base-2 logarithm(so it is expressing the information in "nats" instead of "bits"). Ifyou really prefer bits, just divide mi by log(2).

If you have (or can install) sklearn (i.e. scikit-learn), you can usesklearn.metrics.mutual_info_score, and implement calc_MI as:

from sklearn.metrics import mutual_info_scoredef calc_MI(x, y, bins):    c_xy = np.histogram2d(x, y, bins)[0]    mi = mutual_info_score(None, None, contingency=c_xy)    return mi