Calculate percentile of value in column

python pandas statistics distribution

To find the percentile of a value relative to an array (or in your case a dataframe column), use the scipy function stats.percentileofscore().

For example, if we have a value x (the other numerical value not in the dataframe), and a reference array, arr (the column from the dataframe), we can find the percentile of x by:

from scipy import statspercentile = stats.percentileofscore(arr, x)

Note that there is a third parameter to the stats.percentileofscore() function that has a significant impact on the resulting value of the percentile, viz. kind. You can choose from rank, weak, strict, and mean. See the docs for more information.

For an example of the difference:

>>> df   a0  11  22  33  44  5>>> stats.percentileofscore(df['a'], 4, kind='rank')80.0>>> stats.percentileofscore(df['a'], 4, kind='weak')80.0>>> stats.percentileofscore(df['a'], 4, kind='strict')60.0>>> stats.percentileofscore(df['a'], 4, kind='mean')70.0

As a final note, if you have a value that is greater than 80% of the other values in the column, it would be in the 80th percentile (see the example above for how the kind method affects this final score somewhat) not the 20th percentile. See this Wikipedia article for more information.

python pandas statistics distribution

Sort the column, and see if the value is in the first 20% or whatever percentile.

for example:

def in_percentile(my_series, val, perc=0.2):     myList=sorted(my_series.values.tolist())    l=len(myList)    return val>myList[int(l*perc)]

Or, if you want the actual percentile simply use searchsorted:

my_series.values.searchsorted(val)/len(my_series)*100

python pandas statistics distribution

Since you're looking for values over/under a specific threshold, you could consider using pandas qcut function. If you wanted values under 20% and over 80%, divide your data into 5 equal sized partitions. Each partition would represent a 20% "chunk" of equal size (five 20% partitions is 100%). So, given a DataFrame with 1 column 'a' which represents the column you have data for:

df['newcol'] = pd.qcut(df['a'], 5, labels=False)

This will give you a new column to your DataFrame with each row having a value in (0, 1, 2, 3, 4). Where 0 represents your lowest 20% and 4 represents your highest 20% which is the 80% percentile.

CodeHunter

Calculate percentile of value in column

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last