NumPy percentile function different from MATLAB's percentile function NumPy percentile function different from MATLAB's percentile function numpy numpy

NumPy percentile function different from MATLAB's percentile function


MATLAB apparently uses midpoint interpolation by default. NumPy and R use linear interpolation by default:

In [182]: np.percentile(x, 75, interpolation='linear')Out[182]: 11.312249999999999In [183]: np.percentile(x, 75, interpolation='midpoint')Out[183]: 11.3165

The understand the difference between linear and midpoint, consider this simple example:

In [187]: np.percentile([0, 100], 75, interpolation='linear')Out[187]: 75.0In [188]: np.percentile([0, 100], 75, interpolation='midpoint')Out[188]: 50.0

To compile the latest version of NumPy (using Ubuntu):

mkdir $HOME/srcgit clone https://github.com/numpy/numpy.gitgit remote add upstream https://github.com/numpy/numpy.git# Read ~/src/numpy/INSTALL.txtsudo apt-get install libatlas-base-dev libatlas3gf-basepython setup.py build --fcompiler=gnu95python setup.py install

The advantage of using git instead of pip is that it is super easy to upgrade (or downgrade) to other versions of NumPy (and you get the source code too):

git fetch upstreamgit checkout master # or checkout any other version of NumPycd ~/src/numpy/bin/rm -rf buildcdsitepackages    # assuming you are using virtualenv; otherwise cd to your local python sitepackages directory/bin/rm -rf numpy numpy-*-py2.7.egg-infocd ~/src/numpypython setup.py build --fcompiler=gnu95python setup.py install


Since the accepted answer is still incomplete even after @cpaulik's comment, I'm posting here what is hopefully a more complete answer (although, for brevity reasons, not perfect, see below).

Using np.percentile(x, p, interpolation='midpoint') is only going to give the same answer for very specific values, namely when p/100 is a multiple of 1/n, n being the number of elements of the array. In the original question, this was indeed the case, since n=20 and p=75, but in general the two functions differ.

A short emulation of Matlab's prctile function is given by:

def quantile(x,q):    n = len(x)    y = np.sort(x)    return(np.interp(q, np.linspace(1/(2*n), (2*n-1)/(2*n), n), y))def prctile(x,p):    return(quantile(x,np.array(p)/100))

This function, as Matlab's one, gives a piecewise linear output spanning from min(x) to max(x). Numpy's percentile function, with interpolation=midpoint, returns a piecewise constant function between the average of the two smallest elements and the average of the two largest ones. Plotting the two functions for the array in the original question gives the picture in this link (sorry can't embed it). The dashed red line marks the 75% percentile, where the two functions actually coincide.

P.S. The reason why this function is not actually equivalent to Matlab's one is that it only accepts a one-dimensional x, giving error for higher dimensional stuff. Matlab's one, on the other hand, accepts a higher dim x and operates on the first (non trivial) dimension, but implementing it correctly would probably take a bit longer. However, both this and Matlab's function should correctly work with higher dimensional inputs for p / q (thanks to the usage of np.interp that takes care of it).