How do I get the percentile for a row in a pandas dataframe?
TL; DR
Use
sz = temp['INCOME'].size-1temp['PCNT_LIN'] = temp['INCOME'].rank(method='max').apply(lambda x: 100.0*(x-1)/sz) INCOME PCNT_LIN0 78 44.4444441 38 11.1111112 42 22.2222223 48 33.3333334 31 0.0000005 89 55.5555566 94 66.6666677 102 77.7777788 122 100.0000009 122 100.000000
Answer
It is actually very simple, once your understand the mechanics. When you are looking for percentile of a score, you already have the scores in each row. The only step left is understanding that you need percentile of numbers that are less or equal to the selected value. This is exactly what parameters kind='weak' of scipy.stats.percentileofscore()
and method='average' of DataFrame.rank()
do. In order to invert it, run Series.quantile()
with interpolation='lower'.
So, the behavior of the scipy.stats.percentileofscore()
, Series.rank()
and Series.quantile()
is consistent, see below:
In[]:temp = pd.DataFrame([ 78, 38, 42, 48, 31, 89, 94, 102, 122, 122], columns=['INCOME'])temp['PCNT_RANK']=temp['INCOME'].rank(method='max', pct=True)temp['POF'] = temp['INCOME'].apply(lambda x: scipy.stats.percentileofscore(temp['INCOME'], x, kind='weak'))temp['QUANTILE_VALUE'] = temp['PCNT_RANK'].apply(lambda x: temp['INCOME'].quantile(x, 'lower'))temp['RANK']=temp['INCOME'].rank(method='max')sz = temp['RANK'].size - 1 temp['PCNT_LIN'] = temp['RANK'].apply(lambda x: (x-1)/sz)temp['CHK'] = temp['PCNT_LIN'].apply(lambda x: temp['INCOME'].quantile(x))tempOut[]: INCOME PCNT_RANK POF QUANTILE_VALUE RANK PCNT_LIN CHK0 78 0.5 50.0 78 5.0 0.444444 78.01 38 0.2 20.0 38 2.0 0.111111 38.02 42 0.3 30.0 42 3.0 0.222222 42.03 48 0.4 40.0 48 4.0 0.333333 48.04 31 0.1 10.0 31 1.0 0.000000 31.05 89 0.6 60.0 89 6.0 0.555556 89.06 94 0.7 70.0 94 7.0 0.666667 94.07 102 0.8 80.0 102 8.0 0.777778 102.08 122 1.0 100.0 122 10.0 1.000000 122.09 122 1.0 100.0 122 10.0 1.000000 122.0
Now in a column PCNT_RANK
you get ratio of values that are smaller or equal to the one in a column INCOME
. But if you want the "interpolated" ratio, it is in column PCNT_LIN
. And as you use Series.rank()
for calculations, it is pretty fast and will crunch you 255M numbers in seconds.
Here I will explain how you get the value from using quantile()
with linear
interpolation:
temp['INCOME'].quantile(0.11)37.93
Our data temp['INCOME']
has only ten values. According to the formula from your link to Wiki the rank of 11th percentile is
rank = 11*(10-1)/100 + 1 = 1.99
The truncated part of the rank is 1, which corresponds to the value 31, and the value with the rank 2 (i.e. next bin) is 38. The value of fraction
is the fractional part of the rank. This leads to the result:
31 + (38-31)*(0.99) = 37.93
For the values themselves, the fraction
part have to be zero, so it is very easy to do the inverse calculation to get percentile:
p = (rank - 1)*100/(10 - 1)
I hope I made it more clear.
This seems to work:
A = np.sort(temp['INCOME'].values)np.interp(sample, A, np.linspace(0, 1, len(A)))
For example:
>>> temp.INCOME.quantile(np.interp([37.5, 38, 122, 121], A, np.linspace(0, 1, len(A))))0.103175 37.50.111111 38.01.000000 122.00.883333 121.0Name: INCOME, dtype: float64
Please note that this strategy only makes sense if you want to query a large enough number of values. Otherwise the sorting is too expensive.
Let's consider the below dataframe:
In order to get the percentile of a column in pandas Dataframe we use the following code:
survey['Nationality'].value_counts(normalize='index')
Output:
USA 0.333333
China 0.250000
India 0.250000
Bangadesh 0.166667
Name: Nationality, dtype: float64
In order to get the percentile of a column in pandas Dataframe with respect to another categorical column
pd.crosstab(survey.Sex,survey.Handedness,normalize = 'index')
The output would be something like given below