Simple cross-tabulation in pandas Simple cross-tabulation in pandas python python

Simple cross-tabulation in pandas


v0.21 answer

Use pivot_table with the index parameter:

df.pivot_table(index='category', aggfunc=[len, sum])           len   sum         value valuecategory            AB           2   300AC           1   150AD           1   500

<= v0.12

It is possible to do this using pivot_table for those interested:

In [8]: dfOut[8]:   category  value0       AB    1001       AB    2002       AC    1503       AD    500In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])Out[9]:             len    sum          value  valuecategory              AB            2    300AC            1    150AD            1    500

Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:

In [12]: dfOut[12]:   category  value  value20       AB    100       51       AB    200       52       AC    150       53       AD    500       5In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])Out[13]:             len            sum                  value  value2  value  value2category                              AB            2       2    300      10AC            1       1    150       5AD            1       1    500       5

The main reason to use __builtin__.sum vs. np.sum is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.


Assuming that you have a file called 2010.csv with contents

category,valueAB,100.00AB,200.00AC,150.00AD,500.00

Then, using the ability to apply multiple aggregation functions following a groupby, you can say:

import pandasdata_2010 = pandas.read_csv("/path/to/2010.csv")data_2010.groupby("category").agg([len, sum])

You should get a result that looks something like

          value                 len  sumcategory            AB            2  300AC            1  150AD            1  500

Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.