Grouping values using pandas cut
Starting with:
df: val1 val20 NaN 101 10.18 12 25.16 13 44.48 14 85.24 15 36.71 16 77.09 17 81.88 18 22.92 19 44.31 110 15.79 1
and
xml = """<metaGroups> <Groups> <GroupID>age</GroupID> <description>age</description> <groupname> <Name>0 - <10</Name> <min>0</min> <minInc>TRUE</minInc> <max>10</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>10 - <18</Name> <min>10</min> <minInc>TRUE</minInc> <max>18</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>18 - <35</Name> <min>18</min> <minInc>TRUE</minInc> <max>35</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>35 - <50</Name> <min>35</min> <minInc>TRUE</minInc> <max>50</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>50 - <65</Name> <min>50</min> <minInc>TRUE</minInc> <max>65</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>65 - <75</Name> <min>65</min> <minInc>TRUE</minInc> <max>75</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>≥75</Name> <min>75</min> <minInc>TRUE</minInc> </groupname> </Groups></metaGroups>"""
You can use BeautifulSoup
to extract the bin
parameters, construct the labels and apply pd.cut()
:
from bs4 import BeautifulSoup as Soupfrom itertools import chainsoup = Soup(xml, 'html.parser')bins = []for message in soup.findAll('groupname'): min = message.find('min').text try: max = message.find('max').text bins.append([min, max]) except: bins.append([min]) # For max bin
at which point we have
bins[['0', '10'], ['10', '18'], ['18', '35'], ['35', '50'], ['50', '65'], ['65', '75'], ['75']]
Next, we'll flatten the list
of list
, get rid of duplicates and add an upper bound:
labels = binsbins = list(np.unique(np.fromiter(chain.from_iterable(bins), dtype='int')))last = bins[-1]bins.append(int(df.val1.max() + 1))
which yields:
[0, 10, 18, 35, 50, 65, 75, 86]
Constructing the labels:
labels = ['[{0} - {1}]'.format(label[0], label[1]) if len(label) > 1 else '[ > {} ]'.format(label[0]) for label in labels]
and using pd.cut()
:
df['binned'] = pd.cut(df.val1, bins=bins, labels=labels)
produces:
val1 val2 binned1 10.18 1 [10 - 18]2 25.16 1 [18 - 35]3 44.48 1 [35 - 50]4 85.24 1 [>= 75]5 36.71 1 [35 - 50]6 77.09 1 [>= 75]7 81.88 1 [>= 75]8 22.92 1 [18 - 35]9 44.31 1 [35 - 50]10 15.79 1 [10 - 18]