Pandas sort by group aggregate and column
Groupby A:
In [0]: grp = df.groupby('A')
Within each group, sum over B and broadcast the values using transform. Then sort by B:
In [1]: grp[['B']].transform(sum).sort('B')Out[1]: B2 -2.8297105 -2.8297101 0.2536514 0.2536510 0.5513773 0.551377
Index the original df by passing the index from above. This will re-order the A values by the aggregate sum of the B values:
In [2]: sort1 = df.ix[grp[['B']].transform(sum).sort('B').index]In [3]: sort1Out[3]: A B C2 baz -0.528172 False5 baz -2.301539 True1 bar -0.611756 True4 bar 0.865408 False0 foo 1.624345 False3 foo -1.072969 True
Finally, sort the 'C' values within groups of 'A' using the sort=False
option to preserve the A sort order from step 1:
In [4]: f = lambda x: x.sort('C', ascending=False)In [5]: sort2 = sort1.groupby('A', sort=False).apply(f)In [6]: sort2Out[6]: A B CAbaz 5 baz -2.301539 True 2 baz -0.528172 Falsebar 1 bar -0.611756 True 4 bar 0.865408 Falsefoo 3 foo -1.072969 True 0 foo 1.624345 False
Clean up the df index by using reset_index
with drop=True
:
In [7]: sort2.reset_index(0, drop=True)Out[7]: A B C5 baz -2.301539 True2 baz -0.528172 False1 bar -0.611756 True4 bar 0.865408 False3 foo -1.072969 True0 foo 1.624345 False
Here's a more concise approach...
df['a_bsum'] = df.groupby('A')['B'].transform(sum)df.sort(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1)
The first line adds a column to the data frame with the groupwise sum. The second line performs the sort and then removes the extra column.
Result:
A B C5 baz -2.301539 True2 baz -0.528172 False1 bar -0.611756 True4 bar 0.865408 False3 foo -1.072969 True0 foo 1.624345 False
NOTE: sort
is deprecated, use sort_values
instead
One way to do this is to insert a dummy column with the sums in order to sort:
In [10]: sum_B_over_A = df.groupby('A').sum().BIn [11]: sum_B_over_AOut[11]: Abar 0.253652baz -2.829711foo 0.551376Name: Bin [12]: df['sum_B_over_A'] = df.A.apply(sum_B_over_A.get_value)In [13]: dfOut[13]: A B C sum_B_over_A0 foo 1.624345 False 0.5513761 bar -0.611756 True 0.2536522 baz -0.528172 False -2.8297113 foo -1.072969 True 0.5513764 bar 0.865408 False 0.2536525 baz -2.301539 True -2.829711In [14]: df.sort(['sum_B_over_A', 'A', 'B'])Out[14]: A B C sum_B_over_A5 baz -2.301539 True -2.8297112 baz -0.528172 False -2.8297111 bar -0.611756 True 0.2536524 bar 0.865408 False 0.2536523 foo -1.072969 True 0.5513760 foo 1.624345 False 0.551376
and maybe you would drop the dummy row:
In [15]: df.sort(['sum_B_over_A', 'A', 'B']).drop('sum_B_over_A', axis=1)Out[15]: A B C5 baz -2.301539 True2 baz -0.528172 False1 bar -0.611756 True4 bar 0.865408 False3 foo -1.072969 True0 foo 1.624345 False