Sum rows where value equal in column Sum rows where value equal in column numpy numpy

Sum rows where value equal in column


Pandas has a very very powerful groupby function which makes this very simple.

import pandas as pdn = np.array([[1,2,3],             [1,4,6],              [2,3,5],             [2,6,2],             [3,4,8]])df = pd.DataFrame(n, columns = ["First Col", "Second Col", "Third Col"])df.groupby("First Col").sum()


Approach #1

Here's something in a numpythonic vectorized way based on np.bincount -

# Initial setup             N = A.shape[1]-1unqA1, id = np.unique(A[:, 0], return_inverse=True)# Create subscripts and accumulate with bincount for tagged summationssubs = np.arange(N)*(id.max()+1) + id[:,None]sums = np.bincount( subs.ravel(), weights=A[:,1:].ravel() )# Append the unique elements from first column to get final outputout = np.append(unqA1[:,None],sums.reshape(N,-1).T,1)

Sample input, output -

In [66]: AOut[66]: array([[1, 2, 3],       [1, 4, 6],       [2, 3, 5],       [2, 6, 2],       [7, 2, 1],       [2, 0, 3]])In [67]: outOut[67]: array([[  1.,   6.,   9.],       [  2.,   9.,  10.],       [  7.,   2.,   1.]])

Approach #2

Here's another based on np.cumsum and np.diff -

# Sort A based on first columnsA = A[np.argsort(A[:,0]),:]# Row mask of where each group endsrow_mask = np.append(np.diff(sA[:,0],axis=0)!=0,[True])# Get cummulative summations and then DIFF to get summations for each groupcumsum_grps = sA.cumsum(0)[row_mask,1:]sum_grps = np.diff(cumsum_grps,axis=0)# Concatenate the first unique row with its countscounts = np.concatenate((cumsum_grps[0,:][None],sum_grps),axis=0)# Concatenate the first column of the input array for final outputout = np.concatenate((sA[row_mask,0][:,None],counts),axis=1)

Benchmarking

Here's some runtime tests for the numpy based approaches presented so far for the question -

In [319]: A = np.random.randint(0,1000,(100000,10))In [320]: %timeit cumsum_diff(A)100 loops, best of 3: 12.1 ms per loopIn [321]: %timeit bincount(A)10 loops, best of 3: 21.4 ms per loopIn [322]: %timeit add_at(A)10 loops, best of 3: 60.4 ms per loopIn [323]: A = np.random.randint(0,1000,(100000,20))In [324]: %timeit cumsum_diff(A)10 loops, best of 3: 32.1 ms per loopIn [325]: %timeit bincount(A)10 loops, best of 3: 32.3 ms per loopIn [326]: %timeit add_at(A)10 loops, best of 3: 113 ms per loop

Seems like Approach #2: cumsum + diff is performing quite well.


Try using pandas. Group by the first column and then sum rowwise. Something like

df.groupby(df.ix[:,1]).sum()