Save / load scipy sparse csr_matrix in portable data format

edit: scipy 0.19 now has scipy.sparse.save_npz and scipy.sparse.load_npz.

from scipy import sparsesparse.save_npz("yourmatrix.npz", your_matrix)your_matrix_back = sparse.load_npz("yourmatrix.npz")

For both functions, the file argument may also be a file-like object (i.e. the result of open) instead of a filename.

Got an answer from the Scipy user group:

A csr_matrix has 3 data attributes that matter: .data, .indices, and .indptr. All are simple ndarrays, so numpy.save will work on them. Save the three arrays with numpy.save or numpy.savez, load them back with numpy.load, and then recreate the sparse matrix object with:

new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

So for example:

def save_sparse_csr(filename, array):    np.savez(filename, data=array.data, indices=array.indices,             indptr=array.indptr, shape=array.shape)def load_sparse_csr(filename):    loader = np.load(filename)    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),                      shape=loader['shape'])

python numpy scipy

Though you write, scipy.io.mmwrite and scipy.io.mmread don't work for you, I just want to add how they work. This question is the no. 1 Google hit, so I myself started with np.savez and pickle.dump before switching to the simple and obvious scipy-functions. They work for me and shouldn't be overseen by those who didn't tried them yet.

from scipy import sparse, iom = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>io.mmwrite("test.mtx", m)del mnewm = io.mmread("test.mtx")newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

python numpy scipy

Here is performance comparison of the three most upvoted answers using Jupyter notebook. The input is a 1M x 100K random sparse matrix with density 0.001, containing 100M non-zero values:

from scipy.sparse import randommatrix = random(1000000, 100000, density=0.001, format='csr')matrix<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'with 100000000 stored elements in Compressed Sparse Row format>

`io.mmwrite` / `io.mmread`

from scipy.sparse import io%time io.mmwrite('test_io.mtx', matrix)CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39sWall time: 4min 39s%time matrix = io.mmread('test_io.mtx')CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43sWall time: 2min 43s    matrix<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'with 100000000 stored elements in COOrdinate format>    Filesize: 3.0G.

(note that the format has been changed from csr to coo).

`np.savez` / `np.load`

import numpy as npfrom scipy.sparse import csr_matrixdef save_sparse_csr(filename, array):    # note that .npz extension is added automatically    np.savez(filename, data=array.data, indices=array.indices,             indptr=array.indptr, shape=array.shape)def load_sparse_csr(filename):    # here we need to add .npz extension manually    loader = np.load(filename + '.npz')    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),                      shape=loader['shape'])%time save_sparse_csr('test_savez', matrix)CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 sWall time: 2.74 s    %time matrix = load_sparse_csr('test_savez')CPU times: user 1.18 s, sys: 548 ms, total: 1.73 sWall time: 1.73 smatrix<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'with 100000000 stored elements in Compressed Sparse Row format>Filesize: 1.1G.

`cPickle`

import cPickle as pickledef save_pickle(matrix, filename):    with open(filename, 'wb') as outfile:        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)def load_pickle(filename):    with open(filename, 'rb') as infile:        matrix = pickle.load(infile)        return matrix    %time save_pickle(matrix, 'test_pickle.mtx')CPU times: user 260 ms, sys: 888 ms, total: 1.15 sWall time: 1.15 s    %time matrix = load_pickle('test_pickle.mtx')CPU times: user 376 ms, sys: 988 ms, total: 1.36 sWall time: 1.37 s    matrix<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'with 100000000 stored elements in Compressed Sparse Row format>Filesize: 1.1G.

Note: cPickle does not work with very large objects (see this answer).In my experience, it didn't work for a 2.7M x 50k matrix with 270M non-zero values.np.savez solution worked well.

Conclusion

(based on this simple test for CSR matrices)cPickle is the fastest method, but it doesn't work with very large matrices, np.savez is only slightly slower, while io.mmwrite is much slower, produces bigger file and restores to the wrong format. So np.savez is the winner here.

CodeHunter

Save / load scipy sparse csr_matrix in portable data format

`io.mmwrite` / `io.mmread`

`np.savez` / `np.load`

`cPickle`

Conclusion

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

Save / load scipy sparse csr_matrix in portable data format

io.mmwrite / io.mmread

np.savez / np.load

cPickle

Conclusion

Recent Posts

`io.mmwrite` / `io.mmread`

`np.savez` / `np.load`

`cPickle`