Simple cross-tabulation in pandas
v0.21
answer
Use pivot_table
with the index
parameter:
df.pivot_table(index='category', aggfunc=[len, sum]) len sum value valuecategory AB 2 300AC 1 150AD 1 500
<= v0.12
It is possible to do this using pivot_table
for those interested:
In [8]: dfOut[8]: category value0 AB 1001 AB 2002 AC 1503 AD 500In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])Out[9]: len sum value valuecategory AB 2 300AC 1 150AD 1 500
Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:
In [12]: dfOut[12]: category value value20 AB 100 51 AB 200 52 AC 150 53 AD 500 5In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])Out[13]: len sum value value2 value value2category AB 2 2 300 10AC 1 1 150 5AD 1 1 500 5
The main reason to use __builtin__.sum
vs. np.sum
is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.
Assuming that you have a file called 2010.csv with contents
category,valueAB,100.00AB,200.00AC,150.00AD,500.00
Then, using the ability to apply multiple aggregation functions following a groupby, you can say:
import pandasdata_2010 = pandas.read_csv("/path/to/2010.csv")data_2010.groupby("category").agg([len, sum])
You should get a result that looks something like
value len sumcategory AB 2 300AC 1 150AD 1 500
Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.