Add a sequence number to each element in a group using python

python database pandas count grouping

I stumbled upon the answer which was embarrassingly simple. The groupby statement has a 'cumcount()' option which will enumerate group items.

df['sequence']=df.groupby('patient').cumcount()

The caveat is that the records have to be in the order you want them enumerated.

python database pandas count grouping

Firstly you want to convert the date column to be a pandas datetime (rather than strings):

In [11]: pd.to_datetime(df['date'], format='%d%b%Y')Out[11]:0   2009-06-201   2009-06-242   2009-07-153   2008-02-094   2008-02-215   2010-03-146   2010-05-027   2010-05-12Name: date, dtype: datetime64[ns]

Note: see docs for possible format options.

In [12]: df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')In [13]: dfOut[13]:   patient       date  sequence0      145 2009-06-20         11      145 2009-06-24         22      145 2009-07-15         33      582 2008-02-09         14      582 2008-02-21         25      987 2010-03-14         16      987 2010-05-02         27      987 2010-05-12         3

If this isn't in date order (for each patient), I would sort it first:

In [14]: df = df.sort('date')

Now you can groupby and cumcount:

In [15]: g = df.groupby('patient')In [16]: g.cumcount() + 1Out[16]:2    13    20    11    24    15    26    3dtype: int64

Which is what you want (althout it's out of order):

In [17]: df['sequence'] = g.cumcount() + 1In [18]: dfOut[18]:       patient       date  sequence2      582 2008-02-09         13      582 2008-02-21         20      145 2009-06-24         11      145 2009-07-15         24      987 2010-03-14         15      987 2010-05-02         26      987 2010-05-12         3

To rearrange (though you may not need to) use sort_index (or we could reindex if we saved the initial DataFrame's index):*

In [19]: df.sort_index()Out[19]:   patient       date  sequence0      145 2009-06-24         11      145 2009-07-15         22      582 2008-02-09         13      582 2008-02-21         24      987 2010-03-14         15      987 2010-05-02         26      987 2010-05-12         3

python database pandas count grouping

The question is how do I sort on multiple columns of data.

One simple trick is to use the key parameter to the sorted function.

You'll be sorting by a string built from the columns of the array.

rows = ...# your source datadef date_to_sortable_string(date):  # use datetime package to convert string to sortable date.  pass# Assume x[0] === patient_id and x[1] === encounter date# Sort by patient_id and daterows_sorted = sorted(rows, key=lambda x: "%0.5d-%s" % (x[0], date_to_sortable_string(x[1])))for row in rows_sorted:  print row

CodeHunter

Add a sequence number to each element in a group using python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last