Pandas sort by group aggregate and column Pandas sort by group aggregate and column pandas pandas

Pandas sort by group aggregate and column


Groupby A:

In [0]: grp = df.groupby('A')

Within each group, sum over B and broadcast the values using transform. Then sort by B:

In [1]: grp[['B']].transform(sum).sort('B')Out[1]:          B2 -2.8297105 -2.8297101  0.2536514  0.2536510  0.5513773  0.551377

Index the original df by passing the index from above. This will re-order the A values by the aggregate sum of the B values:

In [2]: sort1 = df.ix[grp[['B']].transform(sum).sort('B').index]In [3]: sort1Out[3]:     A         B      C2  baz -0.528172  False5  baz -2.301539   True1  bar -0.611756   True4  bar  0.865408  False0  foo  1.624345  False3  foo -1.072969   True

Finally, sort the 'C' values within groups of 'A' using the sort=False option to preserve the A sort order from step 1:

In [4]: f = lambda x: x.sort('C', ascending=False)In [5]: sort2 = sort1.groupby('A', sort=False).apply(f)In [6]: sort2Out[6]:         A         B      CAbaz 5  baz -2.301539   True    2  baz -0.528172  Falsebar 1  bar -0.611756   True    4  bar  0.865408  Falsefoo 3  foo -1.072969   True    0  foo  1.624345  False

Clean up the df index by using reset_index with drop=True:

In [7]: sort2.reset_index(0, drop=True)Out[7]:     A         B      C5  baz -2.301539   True2  baz -0.528172  False1  bar -0.611756   True4  bar  0.865408  False3  foo -1.072969   True0  foo  1.624345  False


Here's a more concise approach...

df['a_bsum'] = df.groupby('A')['B'].transform(sum)df.sort(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1)

The first line adds a column to the data frame with the groupwise sum. The second line performs the sort and then removes the extra column.

Result:

    A       B           C5   baz     -2.301539   True2   baz     -0.528172   False1   bar     -0.611756   True4   bar      0.865408   False3   foo     -1.072969   True0   foo      1.624345   False

NOTE: sort is deprecated, use sort_values instead


One way to do this is to insert a dummy column with the sums in order to sort:

In [10]: sum_B_over_A = df.groupby('A').sum().BIn [11]: sum_B_over_AOut[11]: Abar    0.253652baz   -2.829711foo    0.551376Name: Bin [12]: df['sum_B_over_A'] = df.A.apply(sum_B_over_A.get_value)In [13]: dfOut[13]:      A         B      C  sum_B_over_A0  foo  1.624345  False      0.5513761  bar -0.611756   True      0.2536522  baz -0.528172  False     -2.8297113  foo -1.072969   True      0.5513764  bar  0.865408  False      0.2536525  baz -2.301539   True     -2.829711In [14]: df.sort(['sum_B_over_A', 'A', 'B'])Out[14]:      A         B      C   sum_B_over_A5  baz -2.301539   True      -2.8297112  baz -0.528172  False      -2.8297111  bar -0.611756   True       0.2536524  bar  0.865408  False       0.2536523  foo -1.072969   True       0.5513760  foo  1.624345  False       0.551376

and maybe you would drop the dummy row:

In [15]: df.sort(['sum_B_over_A', 'A', 'B']).drop('sum_B_over_A', axis=1)Out[15]:      A         B      C5  baz -2.301539   True2  baz -0.528172  False1  bar -0.611756   True4  bar  0.865408  False3  foo -1.072969   True0  foo  1.624345  False