Add a sequence number to each element in a group using python Add a sequence number to each element in a group using python pandas pandas

Add a sequence number to each element in a group using python


I stumbled upon the answer which was embarrassingly simple. The groupby statement has a 'cumcount()' option which will enumerate group items.

df['sequence']=df.groupby('patient').cumcount()

The caveat is that the records have to be in the order you want them enumerated.


Firstly you want to convert the date column to be a pandas datetime (rather than strings):

In [11]: pd.to_datetime(df['date'], format='%d%b%Y')Out[11]:0   2009-06-201   2009-06-242   2009-07-153   2008-02-094   2008-02-215   2010-03-146   2010-05-027   2010-05-12Name: date, dtype: datetime64[ns]

Note: see docs for possible format options.

In [12]: df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')In [13]: dfOut[13]:   patient       date  sequence0      145 2009-06-20         11      145 2009-06-24         22      145 2009-07-15         33      582 2008-02-09         14      582 2008-02-21         25      987 2010-03-14         16      987 2010-05-02         27      987 2010-05-12         3

If this isn't in date order (for each patient), I would sort it first:

In [14]: df = df.sort('date')

Now you can groupby and cumcount:

In [15]: g = df.groupby('patient')In [16]: g.cumcount() + 1Out[16]:2    13    20    11    24    15    26    3dtype: int64

Which is what you want (althout it's out of order):

In [17]: df['sequence'] = g.cumcount() + 1In [18]: dfOut[18]:       patient       date  sequence2      582 2008-02-09         13      582 2008-02-21         20      145 2009-06-24         11      145 2009-07-15         24      987 2010-03-14         15      987 2010-05-02         26      987 2010-05-12         3

To rearrange (though you may not need to) use sort_index (or we could reindex if we saved the initial DataFrame's index):*

In [19]: df.sort_index()Out[19]:   patient       date  sequence0      145 2009-06-24         11      145 2009-07-15         22      582 2008-02-09         13      582 2008-02-21         24      987 2010-03-14         15      987 2010-05-02         26      987 2010-05-12         3


The question is how do I sort on multiple columns of data.

One simple trick is to use the key parameter to the sorted function.

You'll be sorting by a string built from the columns of the array.

rows = ...# your source datadef date_to_sortable_string(date):  # use datetime package to convert string to sortable date.  pass# Assume x[0] === patient_id and x[1] === encounter date# Sort by patient_id and daterows_sorted = sorted(rows, key=lambda x: "%0.5d-%s" % (x[0], date_to_sortable_string(x[1])))for row in rows_sorted:  print row