How to access sparse matrix elements?
Question:
type(A)
<class 'scipy.sparse.csc.csc_matrix'>
A.shape
(8529, 60877)
print A[0,:]
(0, 25) 1.0
(0, 7422) 1.0
(0, 26062) 1.0
(0, 31804) 1.0
(0, 41602) 1.0
(0, 43791) 1.0
print A[1,:]
(0, 7044) 1.0
(0, 31418) 1.0
(0, 42341) 1.0
(0, 47125) 1.0
(0, 54376) 1.0
print A[:,0]
#nothing returned
Now what I don’t understand is that A[1,:]
should select elements from the 2nd row, yet I get elements from the 1st row via print A[1,:]
. Also, print A[:,0]
should return the first column but I get nothing printed. Why?
Answers:
A[1,:]
is itself a sparse matrix with shape (1, 60877). This is what you are printing, and it has only one row, so all the row coordinates are 0.
For example:
In [41]: a = csc_matrix([[1, 0, 0, 0], [0, 0, 10, 11], [0, 0, 0, 99]])
In [42]: a.todense()
Out[42]:
matrix([[ 1, 0, 0, 0],
[ 0, 0, 10, 11],
[ 0, 0, 0, 99]], dtype=int64)
In [43]: print(a[1, :])
(0, 2) 10
(0, 3) 11
In [44]: print(a)
(0, 0) 1
(1, 2) 10
(1, 3) 11
(2, 3) 99
In [45]: print(a[1, :].toarray())
[[ 0 0 10 11]]
You can select columns, but if there are no nonzero elements in the column, nothing is displayed when it is output with print
:
In [46]: a[:, 3].toarray()
Out[46]:
array([[ 0],
[11],
[99]])
In [47]: print(a[:,3])
(1, 0) 11
(2, 0) 99
In [48]: a[:, 1].toarray()
Out[48]:
array([[0],
[0],
[0]])
In [49]: print(a[:, 1])
In [50]:
The last print
call shows no output because the column a[:, 1]
has no nonzero elements.
To answer your title’s question using a different technique than your question’s details:
csc_matrix
gives you the method .nonzero()
.
Given:
>>> import numpy as np
>>> from scipy.sparse.csc import csc_matrix
>>>
>>> row = np.array( [0, 1, 3])
>>> col = np.array( [0, 2, 3])
>>> data = np.array([1, 4, 16])
>>> A = csc_matrix((data, (row, col)), shape=(4, 4))
You can access the indices poniting to non-zero data by:
>>> rows, cols = A.nonzero()
>>> rows
array([0, 1, 3], dtype=int32)
>>> cols
array([0, 2, 3], dtype=int32)
Which you can then use to access your data, without ever needing to make a dense version of your sparse matrix:
>>> [((i, j), A[i,j]) for i, j in zip(*A.nonzero())]
[((0, 0), 1), ((1, 2), 4), ((3, 3), 16)]
If it is for calculating TFIDF score using TfidfTransformer
, yu can get the IDF by tfidf.idf_
. Then the sparse array name, say ‘a’, a.toarray().
toarray
returns an ndarray; todense
returns a matrix. If you want a matrix, use todense
; otherwise, use toarray
.
I fully acknowledge all the other given answers. This is simply a different approach.
To demonstrate this example I am creating a new sparse matrix:
from scipy.sparse.csc import csc_matrix
a = csc_matrix([[1, 0, 0, 0], [0, 0, 10, 11], [0, 0, 0, 99]])
print(a)
Output:
(0, 0) 1
(1, 2) 10
(1, 3) 11
(2, 3) 99
To access this easily, like the way we access a list, I converted it into a list.
temp_list = []
for i in a:
temp_list.append(list(i.A[0]))
print(temp_list)
Output:
[[1, 0, 0, 0], [0, 0, 10, 11], [0, 0, 0, 99]]
This might look stupid, since I am creating a sparse matrix and converting it back, but there are some functions like TfidfVectorizer and others that return a sparse matrix as output and handling them can be tricky. This is one way to extract data out of a sparse matrix.
Coming into this rather late, but for those seeking a method for indexing into elements of a scipy sparse csr or csc matrix, we can convert the nonzero row, column, and data arrays into a pandas dataframe and extract the element from the data attribute of the matrix. This simple technique doesn’t require conversion to a dense array.
Let’s create sparse array.
import numpy as np
import pandas as pd
from scipy import stats
from scipy.sparse import csr_matrix, random
from numpy.random import default_rng
rng = default_rng()
rvs = stats.poisson(25, loc=10).rvs
A = random(5, 5, density=0.25, random_state=rng, data_rvs=rvs)
A.A
Output
array([[32., 0., 32., 0., 0.],
[ 0., 29., 0., 0., 0.],
[ 0., 0., 0., 30., 0.],
[ 0., 0., 37., 30., 0.],
[ 0., 0., 0., 0., 0.]])
The following function takes a sparse csr or csc matrix, as well as the desired nonzero row, and column indices.
def get_element(matrix, row, col):
rows, cols = matrix.nonzero()
d = {"row": rows, "col": cols, "data": matrix.data}
df = pd.DataFrame(data=d)
element = df[(df["row"] == row) & (df["col"] == col)]["data"].values[0]
return element
To index into A[3,2]:
get_element(A, row=3,col=2)
Output:
37.0
type(A)
<class 'scipy.sparse.csc.csc_matrix'>
A.shape
(8529, 60877)
print A[0,:]
(0, 25) 1.0
(0, 7422) 1.0
(0, 26062) 1.0
(0, 31804) 1.0
(0, 41602) 1.0
(0, 43791) 1.0
print A[1,:]
(0, 7044) 1.0
(0, 31418) 1.0
(0, 42341) 1.0
(0, 47125) 1.0
(0, 54376) 1.0
print A[:,0]
#nothing returned
Now what I don’t understand is that A[1,:]
should select elements from the 2nd row, yet I get elements from the 1st row via print A[1,:]
. Also, print A[:,0]
should return the first column but I get nothing printed. Why?
A[1,:]
is itself a sparse matrix with shape (1, 60877). This is what you are printing, and it has only one row, so all the row coordinates are 0.
For example:
In [41]: a = csc_matrix([[1, 0, 0, 0], [0, 0, 10, 11], [0, 0, 0, 99]])
In [42]: a.todense()
Out[42]:
matrix([[ 1, 0, 0, 0],
[ 0, 0, 10, 11],
[ 0, 0, 0, 99]], dtype=int64)
In [43]: print(a[1, :])
(0, 2) 10
(0, 3) 11
In [44]: print(a)
(0, 0) 1
(1, 2) 10
(1, 3) 11
(2, 3) 99
In [45]: print(a[1, :].toarray())
[[ 0 0 10 11]]
You can select columns, but if there are no nonzero elements in the column, nothing is displayed when it is output with print
:
In [46]: a[:, 3].toarray()
Out[46]:
array([[ 0],
[11],
[99]])
In [47]: print(a[:,3])
(1, 0) 11
(2, 0) 99
In [48]: a[:, 1].toarray()
Out[48]:
array([[0],
[0],
[0]])
In [49]: print(a[:, 1])
In [50]:
The last print
call shows no output because the column a[:, 1]
has no nonzero elements.
To answer your title’s question using a different technique than your question’s details:
csc_matrix
gives you the method .nonzero()
.
Given:
>>> import numpy as np
>>> from scipy.sparse.csc import csc_matrix
>>>
>>> row = np.array( [0, 1, 3])
>>> col = np.array( [0, 2, 3])
>>> data = np.array([1, 4, 16])
>>> A = csc_matrix((data, (row, col)), shape=(4, 4))
You can access the indices poniting to non-zero data by:
>>> rows, cols = A.nonzero()
>>> rows
array([0, 1, 3], dtype=int32)
>>> cols
array([0, 2, 3], dtype=int32)
Which you can then use to access your data, without ever needing to make a dense version of your sparse matrix:
>>> [((i, j), A[i,j]) for i, j in zip(*A.nonzero())]
[((0, 0), 1), ((1, 2), 4), ((3, 3), 16)]
If it is for calculating TFIDF score using TfidfTransformer
, yu can get the IDF by tfidf.idf_
. Then the sparse array name, say ‘a’, a.toarray().
toarray
returns an ndarray; todense
returns a matrix. If you want a matrix, use todense
; otherwise, use toarray
.
I fully acknowledge all the other given answers. This is simply a different approach.
To demonstrate this example I am creating a new sparse matrix:
from scipy.sparse.csc import csc_matrix
a = csc_matrix([[1, 0, 0, 0], [0, 0, 10, 11], [0, 0, 0, 99]])
print(a)
Output:
(0, 0) 1
(1, 2) 10
(1, 3) 11
(2, 3) 99
To access this easily, like the way we access a list, I converted it into a list.
temp_list = []
for i in a:
temp_list.append(list(i.A[0]))
print(temp_list)
Output:
[[1, 0, 0, 0], [0, 0, 10, 11], [0, 0, 0, 99]]
This might look stupid, since I am creating a sparse matrix and converting it back, but there are some functions like TfidfVectorizer and others that return a sparse matrix as output and handling them can be tricky. This is one way to extract data out of a sparse matrix.
Coming into this rather late, but for those seeking a method for indexing into elements of a scipy sparse csr or csc matrix, we can convert the nonzero row, column, and data arrays into a pandas dataframe and extract the element from the data attribute of the matrix. This simple technique doesn’t require conversion to a dense array.
Let’s create sparse array.
import numpy as np
import pandas as pd
from scipy import stats
from scipy.sparse import csr_matrix, random
from numpy.random import default_rng
rng = default_rng()
rvs = stats.poisson(25, loc=10).rvs
A = random(5, 5, density=0.25, random_state=rng, data_rvs=rvs)
A.A
Output
array([[32., 0., 32., 0., 0.],
[ 0., 29., 0., 0., 0.],
[ 0., 0., 0., 30., 0.],
[ 0., 0., 37., 30., 0.],
[ 0., 0., 0., 0., 0.]])
The following function takes a sparse csr or csc matrix, as well as the desired nonzero row, and column indices.
def get_element(matrix, row, col):
rows, cols = matrix.nonzero()
d = {"row": rows, "col": cols, "data": matrix.data}
df = pd.DataFrame(data=d)
element = df[(df["row"] == row) & (df["col"] == col)]["data"].values[0]
return element
To index into A[3,2]:
get_element(A, row=3,col=2)
Output:
37.0