Calculate the Cumulative Distribution Function (CDF) in Python

python numpy machine-learning statistics scipy

(It is possible that my interpretation of the question is wrong. If the question is how to get from a discrete PDF into a discrete CDF, then np.cumsum divided by a suitable constant will do if the samples are equispaced. If the array is not equispaced, then np.cumsum of the array multiplied by the distances between the points will do.)

If you have a discrete array of samples, and you would like to know the CDF of the sample, then you can just sort the array. If you look at the sorted result, you'll realize that the smallest value represents 0% , and largest value represents 100 %. If you want to know the value at 50 % of the distribution, just look at the array element which is in the middle of the sorted array.

Let us have a closer look at this with a simple example:

import matplotlib.pyplot as pltimport numpy as np# create some randomly ddistributed data:data = np.random.randn(10000)# sort the data:data_sorted = np.sort(data)# calculate the proportional values of samplesp = 1. * np.arange(len(data)) / (len(data) - 1)# plot the sorted data:fig = plt.figure()ax1 = fig.add_subplot(121)ax1.plot(p, data_sorted)ax1.set_xlabel('$p$')ax1.set_ylabel('$x$')ax2 = fig.add_subplot(122)ax2.plot(data_sorted, p)ax2.set_xlabel('$x$')ax2.set_ylabel('$p$')

This gives the following plot where the right-hand-side plot is the traditional cumulative distribution function. It should reflect the CDF of the process behind the points, but naturally, it is not as long as the number of points is finite.

cumulative distribution function

This function is easy to invert, and it depends on your application which form you need.

python numpy machine-learning statistics scipy

Assuming you know how your data is distributed (i.e. you know the pdf of your data), then scipy does support discrete data when calculating cdf's

import numpy as npimport scipyimport matplotlib.pyplot as pltimport seaborn as snsx = np.random.randn(10000) # generate samples from normal distribution (discrete data)norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete# plot the cdfsns.lineplot(x=x, y=norm_cdf)plt.show()

We can even print the first few values of the cdf to show they are discrete

print(norm_cdf[:10])>>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,       0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])

The same method to calculate the cdf also works for multiple dimensions: we use 2d data below to illustrate

mu = np.zeros(2) # mean vectorcov = np.array([[1,0.6],[0.6,1]]) # covariance matrix# generate 2d normally distributed samples using 0 mean and the covariance matrix abovex = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samplesnorm_cdf = scipy.stats.norm.cdf(x)print(norm_cdf.shape)>>> (1000, 2)

In the above examples, I had prior knowledge that my data was normally distributed, which is why I used scipy.stats.norm() - there are multiple distributions scipy supports. But again, you need to know how your data is distributed beforehand to use such functions. If you don't know how your data is distributed and you just use any distribution to calculate the cdf, you most likely will get incorrect results.

python numpy machine-learning statistics scipy

The empirical cumulative distribution function is a CDF that jumps exactly at the values in your data set. It is the CDF for a discrete distribution that places a mass at each of your values, where the mass is proportional to the frequency of the value. Since the sum of the masses must be 1, these constraints determine the location and height of each jump in the empirical CDF.

Given an array a of values, you compute the empirical CDF by first obtaining the frequencies of the values. The numpy function unique() is helpful here because it returns not only the frequencies, but also the values in sorted order. To calculate the cumulative distribution, use the cumsum() function, and divide by the total sum. The following function returns the values in sorted order and the corresponding cumulative distribution:

import numpy as npdef ecdf(a):    x, counts = np.unique(a, return_counts=True)    cusum = np.cumsum(counts)    return x, cusum / cusum[-1]

To plot the empirical CDF you can use matplotlib's plot() function. The option drawstyle='steps-post' ensures that jumps occur at the right place. However, you need to force a jump at the smallest data value, so it's necessary to insert an additional element in front of x and y.

import matplotlib.pyplot as pltdef plot_ecdf(a):    x, y = ecdf(a)    x = np.insert(x, 0, x[0])    y = np.insert(y, 0, 0.)    plt.plot(x, y, drawstyle='steps-post')    plt.grid(True)    plt.savefig('ecdf.png')

Example usages:

xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])plot_ecdf(xvec)df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})plot_ecdf(df['x'])

with output:

enter image description here

CodeHunter

Calculate the Cumulative Distribution Function (CDF) in Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last