scipy csr_matrix: understand indptr
Maybe this explanation can help understand the concept:
data
is an array containing all the non zero elements of the sparse matrix.indices
is an array mapping each element indata
to its column in the sparse matrix.indptr
then maps the elements ofdata
andindices
to the rows of the sparse matrix. This is done with the following reasoning:- If the sparse matrix has M rows,
indptr
is an array containing M+1 elements - for row i,
[indptr[i]:indptr[i+1]]
returns the indices of elements to take fromdata
andindices
corresponding to row i. So supposeindptr[i]=k
andindptr[i+1]=l
, the data corresponding to row i would bedata[k:l]
at columnsindices[k:l]
. This is the tricky part, and I hope the following example helps understanding it.
- If the sparse matrix has M rows,
EDIT : I replaced the numbers in data
by letters to avoid confusion in the following example.
Note: the values in indptr
are necessarily increasing, because the next cell in indptr
(the next row) is referring to the next values in data
and indices
corresponding to that row.
Sure, the elements inside indptr are in ascending order.But how to explain the indptr behavior? In short words, until the element inside indptr is the same or doesn't increase, you can skip row index of the sparse matrix.
The following example illustrates the above interpretation of indptr elements:
Example 1) imagine this matrix:
array([[0, 1, 0], [8, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 7]])mat1 = csr_matrix(([1,8,7], [1,0,2], [0,1,2,2,2,3]), shape=(5,3))mat1.indptr# array([0, 1, 2, 2, 2, 3], dtype=int32)mat1.todense() # to get the corresponding sparse matrix
Example 2) Array to CSR_matrix (the case when the sparse matrix already exists):
arr = np.array([[0, 0, 0], [8, 0, 0], [0, 5, 4], [0, 0, 0], [0, 0, 7]])mat2 = csr_matrix(arr))mat2.indptr# array([0, 0, 1, 3, 3, 4], dtype=int32)mat2.indices# array([0, 1, 2, 2], dtype=int32)mat.data# array([8, 5, 4, 7], dtype=int32)
indptr = np.array([0, 2, 3, 6])indices = np.array([0, 2, 2, 0, 1, 2])data = np.array([1, 2, 3, 4, 5, 6])csr_matrix((data, indices, indptr), shape=(3, 3)).toarray()array([[1, 0, 2], [0, 0, 3], [4, 5, 6]])
In the above example from scipy documentation.
The data array contains the non-zero elements present in the sparse matrix traversed row-wise.
The indices array gives the column number for each non-zero data point.
For example :-col[0] for the first element in data i.e. 1, col[2] for second element in data i.e. 2 and so on till the last data element, so the size of the data array and the indices array is same.
The indptr array basically indicates the location of the first element of the row. Its size is one more than the number of rows.
For example :- the first element of indptr is 0 indicating the first element of row[0] present at data[0] i.e. '1', the second element of indptr is 2 indicating the first element in row[1] which is present at data[2] i.e. the element '3' and the third element of indptr is 3 indicating that the first element of row[2] is at data[3] i.e. '4'.
Hope you get the point.