Optimal way to compute pairwise mutual information using numpy
I can't suggest a faster calculation for the outer loop over the n*(n-1)/2vectors, but your implementation of calc_MI(x, y, bins)
can be simplifiedif you can use scipy version 0.13 or scikit-learn.
In scipy 0.13, the lambda_
argument was added to scipy.stats.chi2_contingency
This argument controls the statistic that is computed by the function. Ifyou use lambda_="log-likelihood"
(or lambda_=0
), the log-likelihood ratiois returned. This is also often called the G or G2 statistic. Other thana factor of 2*n (where n is the total number of samples in the contingencytable), this is the mutual information. So you could implement calc_MI
as:
from scipy.stats import chi2_contingencydef calc_MI(x, y, bins): c_xy = np.histogram2d(x, y, bins)[0] g, p, dof, expected = chi2_contingency(c_xy, lambda_="log-likelihood") mi = 0.5 * g / c_xy.sum() return mi
The only difference between this and your implementation is that thisimplementation uses the natural logarithm instead of the base-2 logarithm(so it is expressing the information in "nats" instead of "bits"). Ifyou really prefer bits, just divide mi
by log(2).
If you have (or can install) sklearn
(i.e. scikit-learn), you can usesklearn.metrics.mutual_info_score
, and implement calc_MI
as:
from sklearn.metrics import mutual_info_scoredef calc_MI(x, y, bins): c_xy = np.histogram2d(x, y, bins)[0] mi = mutual_info_score(None, None, contingency=c_xy) return mi