Add a sequence number to each element in a group using python
I stumbled upon the answer which was embarrassingly simple. The groupby statement has a 'cumcount()' option which will enumerate group items.
df['sequence']=df.groupby('patient').cumcount()
The caveat is that the records have to be in the order you want them enumerated.
Firstly you want to convert the date column to be a pandas datetime (rather than strings):
In [11]: pd.to_datetime(df['date'], format='%d%b%Y')Out[11]:0 2009-06-201 2009-06-242 2009-07-153 2008-02-094 2008-02-215 2010-03-146 2010-05-027 2010-05-12Name: date, dtype: datetime64[ns]
Note: see docs for possible format options.
In [12]: df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')In [13]: dfOut[13]: patient date sequence0 145 2009-06-20 11 145 2009-06-24 22 145 2009-07-15 33 582 2008-02-09 14 582 2008-02-21 25 987 2010-03-14 16 987 2010-05-02 27 987 2010-05-12 3
If this isn't in date order (for each patient), I would sort it first:
In [14]: df = df.sort('date')
Now you can groupby and cumcount:
In [15]: g = df.groupby('patient')In [16]: g.cumcount() + 1Out[16]:2 13 20 11 24 15 26 3dtype: int64
Which is what you want (althout it's out of order):
In [17]: df['sequence'] = g.cumcount() + 1In [18]: dfOut[18]: patient date sequence2 582 2008-02-09 13 582 2008-02-21 20 145 2009-06-24 11 145 2009-07-15 24 987 2010-03-14 15 987 2010-05-02 26 987 2010-05-12 3
To rearrange (though you may not need to) use sort_index
(or we could reindex if we saved the initial DataFrame's index):*
In [19]: df.sort_index()Out[19]: patient date sequence0 145 2009-06-24 11 145 2009-07-15 22 582 2008-02-09 13 582 2008-02-21 24 987 2010-03-14 15 987 2010-05-02 26 987 2010-05-12 3
The question is how do I sort on multiple columns of data.
One simple trick is to use the key
parameter to the sorted function.
You'll be sorting by a string built from the columns of the array.
rows = ...# your source datadef date_to_sortable_string(date): # use datetime package to convert string to sortable date. pass# Assume x[0] === patient_id and x[1] === encounter date# Sort by patient_id and daterows_sorted = sorted(rows, key=lambda x: "%0.5d-%s" % (x[0], date_to_sortable_string(x[1])))for row in rows_sorted: print row