Creating a mixture of probability distributions for sampling Creating a mixture of probability distributions for sampling numpy numpy

Creating a mixture of probability distributions for sampling


Sampling from a mixture of distributions (where PDFs are added with some coefficients c_1, c_2, ... c_n) is equivalent to sampling each independently, and then, for each index, picking the value from k-th sample, with probability c_k.

The latter, mixing, step can be efficiently done with numpy.random.choice. Here is an example where three distributions are mixed. The distributions are listed in distributions, and their coefficients in coefficients. There is a fat normal distribution, a uniform distribution, and a narrow normal distribution, with coefficients 0.5, 0.2, 0.3. The mixing happens at data[np.arange(sample_size), random_idx] after random_idx are generated according to given coefficients.

import numpy as npimport matplotlib.pyplot as pltdistributions = [    {"type": np.random.normal, "kwargs": {"loc": -3, "scale": 2}},    {"type": np.random.uniform, "kwargs": {"low": 4, "high": 6}},    {"type": np.random.normal, "kwargs": {"loc": 2, "scale": 1}},]coefficients = np.array([0.5, 0.2, 0.3])coefficients /= coefficients.sum()      # in case these did not add up to 1sample_size = 100000num_distr = len(distributions)data = np.zeros((sample_size, num_distr))for idx, distr in enumerate(distributions):    data[:, idx] = distr["type"](size=(sample_size,), **distr["kwargs"])random_idx = np.random.choice(np.arange(num_distr), size=(sample_size,), p=coefficients)sample = data[np.arange(sample_size), random_idx]plt.hist(sample, bins=100, density=True)plt.show()

histogram


Following @PaulPanzer's pointer in the comments, I created the following subclass for easily creating mixture models from the SciPy distributions. Note, the pdf is not required for my question, but it was nice for me to have.

class MixtureModel(rv_continuous):    def __init__(self, submodels, *args, **kwargs):        super().__init__(*args, **kwargs)        self.submodels = submodels    def _pdf(self, x):        pdf = self.submodels[0].pdf(x)        for submodel in self.submodels[1:]:            pdf += submodel.pdf(x)        pdf /= len(self.submodels)        return pdf    def rvs(self, size):        submodel_choices = np.random.randint(len(self.submodels), size=size)        submodel_samples = [submodel.rvs(size=size) for submodel in self.submodels]        rvs = np.choose(submodel_choices, submodel_samples)        return rvsmixture_gaussian_model = MixtureModel([norm(-3, 1), norm(3, 1)])x_axis = np.arange(-6, 6, 0.001)mixture_pdf = mixture_gaussian_model.pdf(x_axis)mixture_rvs = mixture_gaussian_model.rvs(10)


The code below stores a 1000 samples from N(0,1) and 500 samples from N(7,2) in an array that can then be sampled from.

import numpy as npfrom scipy import statsd = np.concatenate((stats.norm.rvs(0.0, 1.0, 1000), stats.norm.rvs(7.0, 2.0, 500)))np.random.choice(d, 3)  # sample 3 observations

Mixture components other than the Normal can be used (e.g., stats.poisson) and there can be arbitrary number of those.