Binning a column with Python Pandas
You can use pandas.cut
:
bins = [0, 1, 5, 10, 25, 50, 100]df['binned'] = pd.cut(df['percentage'], bins)print (df) percentage binned0 46.50 (25, 50]1 44.20 (25, 50]2 100.00 (50, 100]3 42.12 (25, 50]
bins = [0, 1, 5, 10, 25, 50, 100]labels = [1,2,3,4,5,6]df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)print (df) percentage binned0 46.50 51 44.20 52 100.00 63 42.12 5
bins = [0, 1, 5, 10, 25, 50, 100]df['binned'] = np.searchsorted(bins, df['percentage'].values)print (df) percentage binned0 46.50 51 44.20 52 100.00 63 42.12 5
...and then value_counts
or groupby
and aggregate size
:
s = pd.cut(df['percentage'], bins=bins).value_counts()print (s)(25, 50] 3(50, 100] 1(10, 25] 0(5, 10] 0(1, 5] 0(0, 1] 0Name: percentage, dtype: int64
s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()print (s)percentage(0, 1] 0(1, 5] 0(5, 10] 0(10, 25] 0(25, 50] 3(50, 100] 1dtype: int64
By default cut
returns categorical
.
Series
methods like Series.value_counts()
will use all categories, even if some categories are not present in the data, operations in categorical.
Using the Numba module for speed up.
On big datasets (more than 500k), pd.cut
can be quite slow for binning data.
I wrote my own function in Numba with just-in-time compilation, which is roughly six times faster:
from numba import njit@njitdef cut(arr): bins = np.empty(arr.shape[0]) for idx, x in enumerate(arr): if (x >= 0) & (x < 1): bins[idx] = 1 elif (x >= 1) & (x < 5): bins[idx] = 2 elif (x >= 5) & (x < 10): bins[idx] = 3 elif (x >= 10) & (x < 25): bins[idx] = 4 elif (x >= 25) & (x < 50): bins[idx] = 5 elif (x >= 50) & (x < 100): bins[idx] = 6 else: bins[idx] = 7 return bins
cut(df['percentage'].to_numpy())# array([5., 5., 7., 5.])
Optional: you can also map it to bins as strings:
a = cut(df['percentage'].to_numpy())conversion_dict = {1: 'bin1', 2: 'bin2', 3: 'bin3', 4: 'bin4', 5: 'bin5', 6: 'bin6', 7: 'bin7'}bins = list(map(conversion_dict.get, a))# ['bin5', 'bin5', 'bin7', 'bin5']
Speed comparison:
# Create a dataframe of 8 million rows for testingdfbig = pd.concat([df]*2000000, ignore_index=True)dfbig.shape# (8000000, 1)
%%timeitcut(dfbig['percentage'].to_numpy())# 38 ms ± 616 µs per loop (mean ± standard deviation of 7 runs, 10 loops each)
%%timeitbins = [0, 1, 5, 10, 25, 50, 100]labels = [1,2,3,4,5,6]pd.cut(dfbig['percentage'], bins=bins, labels=labels)# 215 ms ± 9.76 ms per loop (mean ± standard deviation of 7 runs, 10 loops each)