Large Numpy Scipy CSR Matrix, row wise operation Large Numpy Scipy CSR Matrix, row wise operation numpy numpy

Large Numpy Scipy CSR Matrix, row wise operation


vstack is an O(n) operation because it needs to allocate memory for the result and then copy the contents of all arrays you passed as arguments into the result array.

You can simply use multiply to do the operation:

>>> res = counts.multiply(1 / counts.sum(1))  # multiply with inverse>>> res.todense()matrix([[ 0.33333333,  0.        ,  0.66666667],        [ 0.        ,  0.        ,  1.        ],        [ 0.26666667,  0.33333333,  0.4       ]])

But it's also quite easy to use np.lib.stride_tricks.as_strided to do the operation you want (relatively performant). This as_strided function also allows to do more complicated operations on the array (if there is no method or function for your case).

For example using an example csr of the scipy documentation:

>>> from scipy.sparse import csr_matrix>>> import numpy as np>>> row = np.array([0,0,1,2,2,2])>>> col = np.array([0,2,2,0,1,2])>>> data = np.array([1.,2,3,4,5,6])>>> counts = csr_matrix( (data,(row,col)), shape=(3,3) )>>> counts.todense()matrix([[ 1.,  0.,  2.],        [ 0.,  0.,  3.],        [ 4.,  5.,  6.]])

You can divide each row by it's sum like this:

>>> row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr,                                                      shape=(counts.shape[0], 2),                                                     strides=2*counts.indptr.strides)>>> for start, stop in row_start_stop:   ...    row = counts.data[start:stop]...    row /= row.sum()>>> counts.todense()matrix([[ 0.33333333,  0.        ,  0.66666667],        [ 0.        ,  0.        ,  1.        ],        [ 0.26666667,  0.33333333,  0.4       ]])


Edit

@MSeifert answer is much more efficient, and that should be the proper way of doing. I think writing counts[i, :] implies some column slicing is done which I didn't realize. The documentation explicitly says those are really inefficient operations on csr_matrix. Way have indeed a good example of that.

Answer

The doc says row slicing is efficient, I think you should do

for i in range(counts.shape[0]):    counts[i,:] /= counts[i,:].sum()

This way you edit your matrix in place, it stays sparse and you don't have to use vstack. I'm not sure it is the most efficient operation, but at least you should not have memory issues, and there is no slow down effect as the rows are being computed:

import time()s = time.time()for i in range(counts.shape[0]):    counts[i, :] /= (counts[i, :].sum() + 1)    if i % 1000 == 0:        e = time.time()        if i > 0:            print i, e-s        s = time.time()

Performances:

1000 6.001997947692000 6.028940916063000 7.444594860084000 7.100116014485000 6.169981956486000 7.795103073127000 7.001391172418000 7.378215074549000 7.28075814247...

Performances for MSeifert's answer:

row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr, shape=(counts.shape[0], 2),                                                 strides=2*counts.indptr.strides)for i, (start, stop) in enumerate(row_start_stop):    row = counts.data[start:stop]    row /= row.sum()    if i % 1000 == 0:        e = time.time()        if i > 0:            print i,e-s        s = time.time()

1000 0.007357835769652000 0.01083803176883000 0.01021099090584000 0.01315712928775000 0.006702184677126000 0.006088972091677000 0.006636857986458000 0.01644992828379000 0.0061981678009...As to why using vstack is slow, @MSeifert answer is great.