Grouping values using pandas cut Grouping values using pandas cut pandas pandas

Grouping values using pandas cut


Starting with:

df:     val1  val20     NaN    101   10.18     12   25.16     13   44.48     14   85.24     15   36.71     16   77.09     17   81.88     18   22.92     19   44.31     110  15.79     1

and

xml = """<metaGroups>    <Groups>        <GroupID>age</GroupID>        <description>age</description>        <groupname>            <Name>0 - <10</Name>            <min>0</min>            <minInc>TRUE</minInc>            <max>10</max>            <maxInc>FALSE</maxInc>        </groupname>        <groupname>            <Name>10 - <18</Name>            <min>10</min>            <minInc>TRUE</minInc>            <max>18</max>            <maxInc>FALSE</maxInc>        </groupname>        <groupname>            <Name>18 - <35</Name>            <min>18</min>            <minInc>TRUE</minInc>            <max>35</max>            <maxInc>FALSE</maxInc>        </groupname>        <groupname>            <Name>35 - <50</Name>            <min>35</min>            <minInc>TRUE</minInc>            <max>50</max>            <maxInc>FALSE</maxInc>        </groupname>        <groupname>            <Name>50 - <65</Name>            <min>50</min>            <minInc>TRUE</minInc>            <max>65</max>            <maxInc>FALSE</maxInc>        </groupname>        <groupname>            <Name>65 - <75</Name>            <min>65</min>            <minInc>TRUE</minInc>            <max>75</max>            <maxInc>FALSE</maxInc>        </groupname>        <groupname>            <Name>&ge;75</Name>            <min>75</min>            <minInc>TRUE</minInc>        </groupname>    </Groups></metaGroups>"""

You can use BeautifulSoup to extract the bin parameters, construct the labels and apply pd.cut():

from bs4 import BeautifulSoup as Soupfrom itertools import chainsoup = Soup(xml, 'html.parser')bins = []for message in soup.findAll('groupname'):    min = message.find('min').text    try:        max = message.find('max').text        bins.append([min, max])    except:        bins.append([min]) # For max bin

at which point we have

bins[['0', '10'], ['10', '18'], ['18', '35'], ['35', '50'], ['50', '65'], ['65', '75'], ['75']]

Next, we'll flatten the list of list, get rid of duplicates and add an upper bound:

labels = binsbins = list(np.unique(np.fromiter(chain.from_iterable(bins), dtype='int')))last = bins[-1]bins.append(int(df.val1.max() + 1))

which yields:

[0, 10, 18, 35, 50, 65, 75, 86]

Constructing the labels:

labels = ['[{0} - {1}]'.format(label[0], label[1]) if len(label) > 1 else '[ > {} ]'.format(label[0]) for label in labels]

and using pd.cut():

df['binned'] = pd.cut(df.val1, bins=bins, labels=labels)

produces:

     val1  val2     binned1   10.18     1  [10 - 18]2   25.16     1  [18 - 35]3   44.48     1  [35 - 50]4   85.24     1    [>= 75]5   36.71     1  [35 - 50]6   77.09     1    [>= 75]7   81.88     1    [>= 75]8   22.92     1  [18 - 35]9   44.31     1  [35 - 50]10  15.79     1  [10 - 18]