Elementwise maximum of sparse Scipy matrix & vector with broadcasting
A low level approach
As always you can think on how a proper sparse matrix format for this operation is built up, for csr-matrices the main components are shape, data_arr,indices and ind_ptr.With these parts of the scipy.sparse.csr object it is quite straight forward but maybe a bit time consuming to implement an efficient algorithm in a compiled language (C,C++,Cython, Python-Numba). Int his implemenation I used Numba, but porting it to C++ should be easily possible (syntax changes) and maybe avoiding the slicing.
Implementation (first try)
import numpy as npimport numba as nb# get all needed components of the csr object and create a resulting csr object at the enddef sparse_elementwise_maximum_wrap(mat,vec): mat_csr=mat.tocsr() vec_csr=vec.tocsr() shape_mat=mat_csr.shape indices_mat=mat_csr.indices indptr_mat=mat_csr.indptr data_mat=mat_csr.data indices_vec=vec_csr.indices data_vec=vec_csr.data res=sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,indices_vec,data_vec) res=sparse.csr_matrix(res, shape=shape_mat) return res@nb.njit(cache=True)def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data): data_res=[] indices_res=[] indptr_mat_res=[] indptr_mat_=0 indptr_mat_res.append(indptr_mat_) for row_idx in range(shape_mat[0]): mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]] mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]] mat_ptr=0 vec_ptr=0 while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]: ind_mat=mat_row_ind[mat_ptr] ind_vec=vec_row_ind[vec_ptr] #value for both matrix and vector is present if ind_mat==ind_vec: data_res.append(max(mat_row_data[mat_ptr],vec_row_data[vec_ptr])) indices_res.append(ind_mat) mat_ptr+=1 vec_ptr+=1 indptr_mat_+=1 #only value for the matrix is present vector is assumed 0 elif ind_mat<ind_vec: if mat_row_data[mat_ptr] >0: data_res.append(mat_row_data[mat_ptr]) indices_res.append(ind_mat) indptr_mat_+=1 mat_ptr+=1 #only value for the vector is present matrix is assumed 0 else: if vec_row_data[vec_ptr] >0: data_res.append(vec_row_data[vec_ptr]) indices_res.append(ind_vec) indptr_mat_+=1 vec_ptr+=1 for i in range(mat_ptr,mat_row_ind.shape[0]): if mat_row_data[i] >0: data_res.append(mat_row_data[i]) indices_res.append(mat_row_ind[i]) indptr_mat_+=1 for i in range(vec_ptr,vec_row_ind.shape[0]): if vec_row_data[i] >0: data_res.append(vec_row_data[i]) indices_res.append(vec_row_ind[i]) indptr_mat_+=1 indptr_mat_res.append(indptr_mat_) return np.array(data_res),np.array(indices_res),np.array(indptr_mat_res)
Implementation (optimized)
In this approach the lists are replaced by a dynamically resized array. I increased the size of the output in 60 MB steps. On creation of the csr-object, there is also no copy of the data made, just references. If you want avoid a memory overhead you have to copy the arrays in the end.
@nb.njit(cache=True)def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data): mem_step=5_000_000 #preallocate memory for 5M non-zero elements (60 MB in this example) data_res=np.empty(mem_step,dtype=data_mat.dtype) indices_res=np.empty(mem_step,dtype=np.int32) data_res_p=0 indptr_mat_res=np.empty((shape_mat[0]+1),dtype=np.int32) indptr_mat_res[0]=0 indptr_mat_res_p=1 indptr_mat_=0 for row_idx in range(shape_mat[0]): mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]] mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]] #check if resizing is necessary if data_res.shape[0]<data_res_p+shape_mat[1]: #add at least memory for another mem_step elements size_to_add=mem_step if shape_mat[1] >size_to_add: size_to_add=shape_mat[1] data_res_2 =np.empty(data_res.shape[0] +size_to_add,data_res.dtype) indices_res_2=np.empty(indices_res.shape[0]+size_to_add,indices_res.dtype) for i in range(data_res_p): data_res_2[i]=data_res[i] indices_res_2[i]=indices_res[i] data_res=data_res_2 indices_res=indices_res_2 mat_ptr=0 vec_ptr=0 while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]: ind_mat=mat_row_ind[mat_ptr] ind_vec=vec_row_ind[vec_ptr] #value for both matrix and vector is present if ind_mat==ind_vec: data_res[data_res_p]=max(mat_row_data[mat_ptr],vec_row_data[vec_ptr]) indices_res[data_res_p]=ind_mat data_res_p+=1 mat_ptr+=1 vec_ptr+=1 indptr_mat_+=1 #only value for the matrix is present vector is assumed 0 elif ind_mat<ind_vec: if mat_row_data[mat_ptr] >0: data_res[data_res_p]=mat_row_data[mat_ptr] indices_res[data_res_p]=ind_mat data_res_p+=1 indptr_mat_+=1 mat_ptr+=1 #only value for the vector is present matrix is assumed 0 else: if vec_row_data[vec_ptr] >0: data_res[data_res_p]=vec_row_data[vec_ptr] indices_res[data_res_p]=ind_vec data_res_p+=1 indptr_mat_+=1 vec_ptr+=1 for i in range(mat_ptr,mat_row_ind.shape[0]): if mat_row_data[i] >0: data_res[data_res_p]=mat_row_data[i] indices_res[data_res_p]=mat_row_ind[i] data_res_p+=1 indptr_mat_+=1 for i in range(vec_ptr,vec_row_ind.shape[0]): if vec_row_data[i] >0: data_res[data_res_p]=vec_row_data[i] indices_res[data_res_p]=vec_row_ind[i] data_res_p+=1 indptr_mat_+=1 indptr_mat_res[indptr_mat_res_p]=indptr_mat_ indptr_mat_res_p+=1 return data_res[:data_res_p],indices_res[:data_res_p],indptr_mat_res
Maximum memory allocated in the beginning
The performance and usability of this approach heavily depends on the inputs. In this approach the maximal memory is allocated (this could easily cause out of memory errors).
@nb.njit(cache=True)def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data,shrink_to_fit): max_non_zero=shape_mat[0]*vec_row_data.shape[0]+data_mat.shape[0] data_res=np.empty(max_non_zero,dtype=data_mat.dtype) indices_res=np.empty(max_non_zero,dtype=np.int32) data_res_p=0 indptr_mat_res=np.empty((shape_mat[0]+1),dtype=np.int32) indptr_mat_res[0]=0 indptr_mat_res_p=1 indptr_mat_=0 for row_idx in range(shape_mat[0]): mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]] mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]] mat_ptr=0 vec_ptr=0 while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]: ind_mat=mat_row_ind[mat_ptr] ind_vec=vec_row_ind[vec_ptr] #value for both matrix and vector is present if ind_mat==ind_vec: data_res[data_res_p]=max(mat_row_data[mat_ptr],vec_row_data[vec_ptr]) indices_res[data_res_p]=ind_mat data_res_p+=1 mat_ptr+=1 vec_ptr+=1 indptr_mat_+=1 #only value for the matrix is present vector is assumed 0 elif ind_mat<ind_vec: if mat_row_data[mat_ptr] >0: data_res[data_res_p]=mat_row_data[mat_ptr] indices_res[data_res_p]=ind_mat data_res_p+=1 indptr_mat_+=1 mat_ptr+=1 #only value for the vector is present matrix is assumed 0 else: if vec_row_data[vec_ptr] >0: data_res[data_res_p]=vec_row_data[vec_ptr] indices_res[data_res_p]=ind_vec data_res_p+=1 indptr_mat_+=1 vec_ptr+=1 for i in range(mat_ptr,mat_row_ind.shape[0]): if mat_row_data[i] >0: data_res[data_res_p]=mat_row_data[i] indices_res[data_res_p]=mat_row_ind[i] data_res_p+=1 indptr_mat_+=1 for i in range(vec_ptr,vec_row_ind.shape[0]): if vec_row_data[i] >0: data_res[data_res_p]=vec_row_data[i] indices_res[data_res_p]=vec_row_ind[i] data_res_p+=1 indptr_mat_+=1 indptr_mat_res[indptr_mat_res_p]=indptr_mat_ indptr_mat_res_p+=1 if shrink_to_fit==True: data_res=np.copy(data_res[:data_res_p]) indices_res=np.copy(indices_res[:data_res_p]) else: data_res=data_res[:data_res_p] indices_res=indices_res[:data_res_p] return data_res,indices_res,indptr_mat_res# get all needed components of the csr object and create a resulting csr object at the enddef sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=True): mat_csr=mat.tocsr() vec_csr=vec.tocsr() shape_mat=mat_csr.shape indices_mat=mat_csr.indices indptr_mat=mat_csr.indptr data_mat=mat_csr.data indices_vec=vec_csr.indices data_vec=vec_csr.data res=sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,indices_vec,data_vec,shrink_to_fit) res=sparse.csr_matrix(res, shape=shape_mat) return res
Timings
Numba has a compilation overhead or some overhead to load the function from cache. Don't consider the first call if you want to get the runtime and not compilation+runtime.
import numpy as npfrom scipy import sparsemat = sparse.csr_matrix( sparse.random(20000, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s)) )vec = sparse.csr_matrix( sparse.random(1, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s)) )%timeit output=sparse_elementwise_maximum(mat, vec)#for csc input37.9 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)#for csr input10.7 s ± 90.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)#Daniel F%timeit sparse_maximum(mat, vec)164 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)#low level implementation (first try)%timeit res=sparse_elementwise_maximum_wrap(mat,vec)89.7 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)#low level implementation (optimized, csr)%timeit res=sparse_elementwise_maximum_wrap(mat,vec)16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)#low level implementation (preallocation, without copying at the end)%timeit res=sparse_elementwise_maximum_wrap(mat,vec)16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)#low level implementation (preallocation, with copying at the end)%timeit res=sparse_elementwise_maximum_wrap(mat,vec)16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%timeit res=sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=False)14.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%timeit res=sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=True)21.7 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)#For comparison, copying the result takes%%timeitnp.copy(res.data)np.copy(res.indices)np.copy(res.indptr)7.8 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
scipy.sparse
matrices don't broadcast. At all. So unless you can figure out some way to operate on the indices
and inpts
(I haven't), you're stuck stacking. Best I can figure out is just to vstack
your vec
s until they're the same shape as mat
. It seems to give a good speedup, although it doesn't explain the segfault weirdness with csr
.
#using `mat` and `vec` from the speed testdef sparse_maximum(mat, vec): vec1 = sparse.vstack([vec for _ in range(mat.shape[0])]) return mat.maximum(vec1)# Time itnum_timing_loops = 3.0starttime = timeit.default_timer()sparse_maximum(mat, vec)print('time per call is:', (timeit.default_timer() - starttime)/num_timing_loops, 'seconds')# I was getting 11-12 seconds on your original codetime per call is: 0.514533479333295 seconds
Proof that it works on original matrices:
vec = sparse.vstack([vec for _ in range(4)])print(mat.maximum(vec).todense())[[ 0 5 100] [ 3 5 100] [ 6 7 100] [ 9 10 100]]
Looking at the maximum
method, and code, especially the _binopt
method in /scipy/sparse/compressed.py
it's apparent that it can work with a scalar other
. For a sparse other
it constructs a new sparse matrix (of the same format and shape) using indptr
, etc values. If other has the same shape, it works:
In [55]: mat = sparse.csr_matrix(np.arange(12).reshape((4,3)))In [64]: mat.maximum(mat)Out[64]: <4x3 sparse matrix of type '<class 'numpy.int64'>' with 11 stored elements in Compressed Sparse Row format>
It fails is the other is a 1d sparse matrix:
In [65]: mat.maximum(mat[0,:])Segmentation fault (core dumped)
mat.maximum(mat[:,0])
runs without error, though I'm not sure about the values. mat[:,0]
will have the same size indptr
.
I thought the mat.maximum(mat[:,0])
would give same fault if mat
was csc
, but it doesn't.
Let's be honest, this kind of operation is not a strong point for sparse matrices. The core of its math is matrix multiplication. That's what they were originally developed for - sparse linear algebra problems such as finite difference and finite element.