scipy csr_matrix: understand indptr

Question:

Every once in a while, I get to manipulate a csr_matrix but I always forget how the parameters indices and indptr work together to build a sparse matrix.

I am looking for a clear and intuitive explanation on how the indptr interacts with both the data and indices parameters when defining a sparse matrix using the notation csr_matrix((data, indices, indptr), [shape=(M, N)]).

I can see from the scipy documentation that the data parameter contains all the non-zero data, and the indices parameter contains the columns associated to that data (as such, indices is equal to col in the example given in the documentation). But how can we explain in clear terms the indptr parameter?

Asked By: Tanguy

||

Answers:

Maybe this explanation can help understand the concept:

  • data is an array containing all the non zero elements of the sparse matrix.
  • indices is an array mapping each element in data to its column in the sparse matrix.
  • indptr then maps the elements of data and indices to the rows of the sparse matrix. This is done with the following reasoning:

    1. If the sparse matrix has M rows, indptr is an array containing M+1 elements
    2. for row i, [indptr[i]:indptr[i+1]] returns the indices of elements to take from data and indices corresponding to row i. So suppose indptr[i]=k and indptr[i+1]=l, the data corresponding to row i would be data[k:l] at columns indices[k:l]. This is the tricky part, and I hope the following example helps understanding it.

EDIT : I replaced the numbers in data by letters to avoid confusion in the following example.

enter image description here

Note: the values in indptr are necessarily increasing, because the next cell in indptr (the next row) is referring to the next values in data and indices corresponding to that row.

Answered By: Tanguy

Sure, the elements inside indptr are in ascending order.
But how to explain the indptr behavior? In short words, until the element inside indptr is the same or doesn’t increase, you can skip row index of the sparse matrix.

The following example illustrates the above interpretation of indptr elements:

Example 1) imagine this matrix:

array([[0, 1, 0],
       [8, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 7]])


mat1 = csr_matrix(([1,8,7], [1,0,2], [0,1,2,2,2,3]), shape=(5,3))
mat1.indptr
# array([0, 1, 2, 2, 2, 3], dtype=int32)
mat1.todense()  # to get the corresponding sparse matrix

Example 2) Array to CSR_matrix (the case when the sparse matrix already exists):

arr = np.array([[0, 0, 0],
                [8, 0, 0],
                [0, 5, 4],
                [0, 0, 0],
                [0, 0, 7]])


mat2 = csr_matrix(arr))
mat2.indptr
# array([0, 0, 1, 3, 3, 4], dtype=int32)
mat2.indices
# array([0, 1, 2, 2], dtype=int32)
mat.data
# array([8, 5, 4, 7], dtype=int32)
Answered By: A. Nadjar
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
csr_matrix((data, indices, indptr), shape=(3, 3)).toarray()
array([[1, 0, 2],
      [0, 0, 3],
      [4, 5, 6]])

In the above example from scipy documentation.

  • The data array contains the non-zero elements present in the sparse matrix traversed row-wise.

  • The indices array gives the column number for each non-zero data point.

  • For example :-col[0] for the first element in data i.e. 1, col[2] for second element in data i.e. 2 and so on till the last data element, so the size of the data array and the indices array is same.

  • The indptr array basically indicates the location of the first element of the row. Its size is one more than the number of rows.

  • For example :- the first element of indptr is 0 indicating the first element of row[0] present at data[0] i.e. ‘1’, the second element of indptr is 2 indicating the first element in row[1] which is present at data[2] i.e. the element ‘3’ and the third element of indptr is 3 indicating that the first element of row[2] is at data[3] i.e. ‘4’.

  • Hope you get the point.

Answered By: om belote

Since this is a sparse matrix, it means that the non-zero elements in the matrix are relatively very few compared to the whole elements($m times n$).

We use :

  • data to store all the non-zero elements out there, from left to right, top to bottom
  • indices to store all the column indices for each of these data
  • indptr[i]:indptr[i+1] to represent the slice in data field to find row[i]’s all non-zero elements
Answered By: jp z

In this example:

indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
csr_matrix((data, indices, indptr), shape=(3, 3)).toarray()
array([[1, 0, 2],
      [0, 0, 3],
      [4, 5, 6]])

To read indptr do this-

  • Ignore indptr[0] = 0
  • indptr[1] = 2 tells the number of non-zero data elements, upto end of first row
  • indptr[2] = 3 tells the number of non-zero data elements, starting from beggining upto end of second row.
  • indptr[3] = 6 tells the number of non-zero data elements, starting from beginning upto end of third row.
Answered By: bRajat

Think of the values in indptr as the number of non-zero elements already passed by before the start of a specific row in the pre-compressed (sparse) format. This is a handful to understand but the example below should clarify.

import numpy as np
from scipy.sparse import csr_matrix

array_for_csr = np.array([[2, 0, 19, 5],
                          [8, 0, 0, 1],
                          [0, 0, 0, 0],
                          [4, 6, 6, 0]])
matrix = csr_matrix(array_for_csr)
print(matrix)
"""
(0, 0)  2
(0, 2)  19
(0, 3)  5
(1, 0)  8
(1, 3)  1
(3, 0)  4
(3, 1)  6
(3, 2)  6
"""
print(matrix.indices)
# [0 2 3 0 3 0 1 2]
print(matrix.indptr)
# [0 3 5 5 8]

Ex.

indptr[0] = 0 since 0 values in the matrix have been passed by before
the start of the 1st row in the pre-compressed matrix (There are no values passed since we haven’t started traversing the matrix)

indptr[1] = 3 since 3 values in the matrix have been passed by before the start of the 2nd row in the pre-compressed matrix (values 2, 19, 5)

indptr[2] = 5 since 5 values in the matrix have been passed by before the start of the 3rd row in the pre-compressed matrix (values 2, 19, 5, 8, 1)

indptr[3] = 5 since 5 values in the matrix have been passed by before the start of the 4th row in the pre-compressed matrix (since all values in 4rd row of pre-compressed matrix were zero)

indptr[4] = 8 since 8 values in the matrix have been past by before the start of the 5th row in the pre-compressed matrix (the last value in the indptr array will always be equal to the number of non-zero values in the pre-compressed (sparse) matrix

Answered By: kylenewm

It’s actually quite simple.

indptr is a list showing for each column, one by one, at what element’s index this column begins.

For example:

rows = np.array([0, 0, 1, 2, 2])
cols = np.array([0, 2, 0, 0, 1])
data = np.array([1, 2, 3, 4, 5])
sparse_matrix = csc_matrix((data, (rows, cols)))
[[1, 0, 2],
 [3, 0, 0],
 [4, 5, 0]]

indptr = sparse_matrix.indptr
[0, 3, 4, 5]

Here is the secret:

col_data = sparse_matrix.data  # data, column-by-column
[1, 3, 4, 5, 2]

indptr is a list of indices in col_data where each new column begins.

See for yourself:

  • column 0 starts with element 1 which is at index 0 = indptr[0] in col_data
  • column 1 starts with element 5 which is at index 3 = indptr[1] in col_data
  • column 2 starts with element 2 which is at index 4 = indptr[2] in col_data
  • column 3 would begin at index 5 = indptr[3] in col_data, i.e. right outside of it
Answered By: Vladimir Fokow
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.