Large Numpy Scipy CSR Matrix, row wise operation

python numpy scipy sum

vstack is an O(n) operation because it needs to allocate memory for the result and then copy the contents of all arrays you passed as arguments into the result array.

You can simply use multiply to do the operation:

>>> res = counts.multiply(1 / counts.sum(1))  # multiply with inverse>>> res.todense()matrix([[ 0.33333333,  0.        ,  0.66666667],        [ 0.        ,  0.        ,  1.        ],        [ 0.26666667,  0.33333333,  0.4       ]])

But it's also quite easy to use np.lib.stride_tricks.as_strided to do the operation you want (relatively performant). This as_strided function also allows to do more complicated operations on the array (if there is no method or function for your case).

For example using an example csr of the scipy documentation:

>>> from scipy.sparse import csr_matrix>>> import numpy as np>>> row = np.array([0,0,1,2,2,2])>>> col = np.array([0,2,2,0,1,2])>>> data = np.array([1.,2,3,4,5,6])>>> counts = csr_matrix( (data,(row,col)), shape=(3,3) )>>> counts.todense()matrix([[ 1.,  0.,  2.],        [ 0.,  0.,  3.],        [ 4.,  5.,  6.]])

You can divide each row by it's sum like this:

>>> row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr,                                                      shape=(counts.shape[0], 2),                                                     strides=2*counts.indptr.strides)>>> for start, stop in row_start_stop:   ...    row = counts.data[start:stop]...    row /= row.sum()>>> counts.todense()matrix([[ 0.33333333,  0.        ,  0.66666667],        [ 0.        ,  0.        ,  1.        ],        [ 0.26666667,  0.33333333,  0.4       ]])

python numpy scipy sum

Edit

@MSeifert answer is much more efficient, and that should be the proper way of doing. I think writing counts[i, :] implies some column slicing is done which I didn't realize. The documentation explicitly says those are really inefficient operations on csr_matrix. Way have indeed a good example of that.

Answer

The doc says row slicing is efficient, I think you should do

for i in range(counts.shape[0]):    counts[i,:] /= counts[i,:].sum()

This way you edit your matrix in place, it stays sparse and you don't have to use vstack. I'm not sure it is the most efficient operation, but at least you should not have memory issues, and there is no slow down effect as the rows are being computed:

import time()s = time.time()for i in range(counts.shape[0]):    counts[i, :] /= (counts[i, :].sum() + 1)    if i % 1000 == 0:        e = time.time()        if i > 0:            print i, e-s        s = time.time()

Performances:

1000 6.001997947692000 6.028940916063000 7.444594860084000 7.100116014485000 6.169981956486000 7.795103073127000 7.001391172418000 7.378215074549000 7.28075814247...

Performances for MSeifert's answer:

row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr, shape=(counts.shape[0], 2),                                                 strides=2*counts.indptr.strides)for i, (start, stop) in enumerate(row_start_stop):    row = counts.data[start:stop]    row /= row.sum()    if i % 1000 == 0:        e = time.time()        if i > 0:            print i,e-s        s = time.time()

1000 0.007357835769652000 0.01083803176883000 0.01021099090584000 0.01315712928775000 0.006702184677126000 0.006088972091677000 0.006636857986458000 0.01644992828379000 0.0061981678009...As to why using vstack is slow, @MSeifert answer is great.

CodeHunter

Large Numpy Scipy CSR Matrix, row wise operation

Edit

Answer

Performances:

Performances for MSeifert's answer:

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last