How to get the cumulative distribution function with NumPy? How to get the cumulative distribution function with NumPy? numpy numpy

How to get the cumulative distribution function with NumPy?


Using a histogram is one solution but it involves binning the data. This is not necessary for plotting a CDF of empirical data. Let F(x) be the count of how many entries are less than x then it goes up by one, exactly where we see a measurement. Thus, if we sort our samples then at each point we increment the count by one (or the fraction by 1/N) and plot one against the other we will see the "exact" (i.e. un-binned) empirical CDF.

A following code sample demonstrates the method

import numpy as npimport matplotlib.pyplot as pltN = 100Z = np.random.normal(size = N)# method 1H,X1 = np.histogram( Z, bins = 10, normed = True )dx = X1[1] - X1[0]F1 = np.cumsum(H)*dx#method 2X2 = np.sort(Z)F2 = np.array(range(N))/float(N)plt.plot(X1[1:], F1)plt.plot(X2, F2)plt.show()

It outputs the following

enter image description here


I'm not really sure what your code is doing, but if you have hist and bin_edges arrays returned by numpy.histogram you can use numpy.cumsum to generate a cumulative sum of the histogram contents.

>>> import numpy as np>>> hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)>>> bin_edgesarray([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])>>> histarray([ 0.14444444,  0.11111111,  0.11111111,  0.1       ,  0.1       ,        0.14444444,  0.14444444,  0.08888889,  0.03333333,  0.13333333])>>> np.cumsum(hist)array([ 0.14444444,  0.25555556,  0.36666667,  0.46666667,  0.56666667,        0.71111111,  0.85555556,  0.94444444,  0.97777778,  1.11111111])


update for numpy version 1.9.0. user545424's answer does not work in 1.9.0. This works:

>>> import numpy as np>>> arr = np.random.randint(0,10,100)>>> hist, bin_edges = np.histogram(arr, density=True)>>> hist = array([ 0.16666667,  0.15555556,  0.15555556,  0.05555556,  0.08888889,    0.08888889,  0.07777778,  0.04444444,  0.18888889,  0.08888889])>>> histarray([ 0.1       ,  0.11111111,  0.11111111,  0.08888889,  0.08888889,    0.15555556,  0.11111111,  0.13333333,  0.1       ,  0.11111111])>>> bin_edgesarray([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])>>> np.diff(bin_edges)array([ 0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9])>>> np.diff(bin_edges)*histarray([ 0.09,  0.1 ,  0.1 ,  0.08,  0.08,  0.14,  0.1 ,  0.12,  0.09,  0.1 ])>>> cdf = np.cumsum(hist*np.diff(bin_edges))>>> cdfarray([ 0.15,  0.29,  0.43,  0.48,  0.56,  0.64,  0.71,  0.75,  0.92,  1.  ])>>>