Out-of-core processing of sparse CSR arrays

python scipy apache-spark-mllib dask joblib

So I don't know anything about joblib or dask, let alone your application specific data format. But it is actually possible to read sparse matrices from disk in chunks while retaining the sparse data structure.

While the Wikipedia article for the CSR format does a great job explaining how it works, I'll give a short recap:

Some sparse Matrix, e.g.:

1 0 20 0 34 5 6

is stored by remembering each nonzero-value and the column it resides in:

sparse.data    = 1 2 3 4 5 6  # acutal valuesparse.indices = 0 2 2 0 1 2  # number of column (0-indexed)

Now we are still missing the rows. The compressed format just stores how many non-zero values there are in each row, instead of storing every single values row.

Note that the non-zero count is also accumulated, so the following array contains the number of non-zero values up until and including this row. To complicate things even further, the array always starts with a 0 and thus contains num_rows+1 entries:

sparse.indptr = 0 2 3 6

so up until and including the second row there are 3 nonzero values, namely 1, 2 and 3.

Since we got this sorted out, we can start 'slicing' the matrix. The goal is to construct the data, indices and indptr arrays for some chunks. Assume the original huge matrix is stored in three binary files, which we will incrementally read. We use a generator to repeatedly yield some chunk.

For this we need to know how many non-zero values are in each chunk, and read the according amount of values and column-indices. The non-zero count can be conveniently read from the indptr array. This is achieved by reading some amount of entries from the huge indptr file that corresponds to the desired chunk size. The last entry of that portion of the indptr file minus the number of non-zero values before gives the number of non-zeros in that chunk. So the chunks data and indices arrays are just sliced from the big data and indices files. The indptr array needs to be prepended artificially with a zero (that's what the format wants, don't ask me :D).

Then we can just construct a sparse matrix with the chunk data, indices and indptr to get a new sparse matrix.

It has to be noted that the actual matrix size cannot be directly reconstructed from the three arrays alone. It is either the maximum column index of the matrix, or if you are unlucky and there is no data in the chunk undetermined. So we also need to pass the column count.

I probably explained things in a rather complicated way, so just read this just as opaque piece of code, that implements such a generator:

import numpy as npimport scipy.sparsedef gen_batches(batch_size, sparse_data_path, sparse_indices_path,                 sparse_indptr_path, dtype=np.float32, column_size=None):    data_item_size = dtype().itemsize    with open(sparse_data_path, 'rb') as data_file, \            open(sparse_indices_path, 'rb') as indices_file, \            open(sparse_indptr_path, 'rb') as indptr_file:        nnz_before = np.fromstring(indptr_file.read(4), dtype=np.int32)        while True:            indptr_batch = np.frombuffer(nnz_before.tobytes() +                              indptr_file.read(4*batch_size), dtype=np.int32)            if len(indptr_batch) == 1:                break            batch_indptr = indptr_batch - nnz_before            nnz_before = indptr_batch[-1]            batch_nnz = np.asscalar(batch_indptr[-1])            batch_data = np.frombuffer(data_file.read(                                       data_item_size * batch_nnz), dtype=dtype)            batch_indices = np.frombuffer(indices_file.read(                                          4 * batch_nnz), dtype=np.int32)            dimensions = (len(indptr_batch)-1, column_size)            matrix = scipy.sparse.csr_matrix((batch_data,                            batch_indices, batch_indptr), shape=dimensions)            yield matrixif __name__ == '__main__':    sparse = scipy.sparse.random(5, 4, density=0.1, format='csr', dtype=np.float32)    sparse.data.tofile('sparse.data')        # dtype as specified above  ^^^^^^^^^^    sparse.indices.tofile('sparse.indices')  # dtype=int32    sparse.indptr.tofile('sparse.indptr')    # dtype=int32    print(sparse.toarray())    print('========')    for batch in gen_batches(2, 'sparse.data', 'sparse.indices',                              'sparse.indptr', column_size=4):        print(batch.toarray())

the numpy.ndarray.tofile() just stores binary arrays, so you need to remember the data format. scipy.sparse represents the indices and indptr as int32, so that's a limitation for the total matrix size.

Also I benchmarked the code and found that the scipy csr matrix constructor is the bottleneck for small matrices. Your mileage might vary tho, this is just a 'proof of principle'.

If there is need for a more sophisticated implementation, or something is too obstruse, just hit me up :)

CodeHunter

Out-of-core processing of sparse CSR arrays

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last