Scipy Sparse Row/Column Dot Products
Question:
What is the readable and efficient way to compute the dot product between two columns or rows of a sparse matrix using scipy? Let’s say that we want to take the dot product of two vectors x
and y
, two columns of sparse matrix A
, then I’m currently doing:
x = A.getcol(i)
y = A.getcol(j)
dot = (x.transpose() * y)[0,0]
A
is stored in csc
form for efficiency. Is there a more readable way to get the dot product without sacrificing efficiency?
Note: Using Python 2.7.2 with scipy 0.11.0
Answers:
Sparse matrices have a dot
method, so you can also go with
dot = x.T.dot(y)[0, 0]
But I personally find your code at least as good as the above.
I found it to be somewhat more efficient to use csc_matrix.multiply
:
In [77]: %timeit (x.transpose() * y)[0,0]
98.6 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [78]: %timeit x.T.dot(y)[0,0]
133 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [79]: %timeit x.multiply(y).sum(0)[0, 0]
77.7 µs ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [80]: %timeit x.multiply(y).sum()
86.1 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
It might also be worth pointing out that csc_array
behaves rather differently from csc_matrix
when it comes to these operations; the shapes come out differently, transpose
no longer works, but np.dot
works off the shelf:
In [85]: x2 = csc_array(x)
In [87]: y2 = csc_array(y)
In [108]: %timeit np.dot(x2, y2).sum()
70.9 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [115]: %timeit x2.T.dot(y2)[0, 0]
103 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [117]: %timeit x2.multiply(y2).sum(0)[0]
70.5 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [120]: %timeit x2.multiply(y2).sum()
59.6 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
And if you end up with rows instead of columns — i.e. csr_array
instead of csc_array
— then the fastest solution above is no longer the fastest (and is in fact extremely slow for very large vectors)!
In [123]: x3 = csr_array(x.T)
In [124]: y3 = csr_array(y.T)
In [130]: %timeit np.dot(x3, y3).sum()
72.2 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [134]: %timeit x3.dot(y3.T)[0, 0]
92.6 µs ± 358 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [139]: %timeit x3.multiply(y3).sum(1)[0]
62.7 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [140]: %timeit x3.multiply(y3).sum()
76.4 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
An example with a very large (but very sparse) vector:
In [142]: a = csr_array([1, 2], shape=(1, 1_000_000_000))
In [147]: %time np.dot(a, a).sum()
CPU times: user 205 ms, sys: 926 ms, total: 1.13 s
Wall time: 1.13 s
Out[147]: 5
In [149]: %time a.dot(a.T)[0, 0]
CPU times: user 854 ms, sys: 513 ms, total: 1.37 s
Wall time: 1.37 s
Out[149]: 5
In [151]: %time a.multiply(a).sum(1)[0]
CPU times: user 181 µs, sys: 0 ns, total: 181 µs
Wall time: 182 µs
Out[151]: 5
In [152]: %time a.multiply(a).sum()
CPU times: user 229 ms, sys: 902 ms, total: 1.13 s
Wall time: 1.13 s
Out[152]: 5
If you can use Cython or Numba, it’s possible to do it even faster. For Cython, an implementation to calculate the product of two shape (1, n)
csr_array
with float64
values which gives me some 10%-20% performance improvement over a.multiply(b).sum(1)[0]
could look something like the following:
@cython.boundscheck(False)
@cython.wraparound(False)
def cython_dot(a, b):
cdef int na = a.nnz
cdef int nb = b.nnz
cdef double[:] a_data_view = a.data
cdef int[:] a_indices_view = a.indices
cdef double[:] b_data_view = b.data
cdef int[:] b_indices_view = b.indices
cdef int a_ptr = -1
cdef int b_ptr = -1
cdef int aiv, biv
cdef double result = 0
cdef bint a_update = True
cdef bint b_update = True
while True:
if a_update:
a_ptr += 1
aiv = a_indices_view[a_ptr]
a_update = False
if b_update:
b_ptr += 1
biv = b_indices_view[b_ptr]
b_update = False
if a_ptr == na or b_ptr == nb:
break
if aiv < biv:
a_update = True
elif aiv > biv:
b_update = True
else:
result += a_data_view[a_ptr] * b_data_view[b_ptr]
a_update = True
b_update = True
return result
Here,
result += a_data_view[a_ptr] * b_data_view[b_ptr]
could also use libc.math.fma
if higher precision is necessary (but this makes it a bit slower in my testing):
result = fma(a_data_view[a_ptr], b_data_view[b_ptr], result)
A fast Numba implementation could be similar and is about as fast in my testing.
What is the readable and efficient way to compute the dot product between two columns or rows of a sparse matrix using scipy? Let’s say that we want to take the dot product of two vectors x
and y
, two columns of sparse matrix A
, then I’m currently doing:
x = A.getcol(i)
y = A.getcol(j)
dot = (x.transpose() * y)[0,0]
A
is stored in csc
form for efficiency. Is there a more readable way to get the dot product without sacrificing efficiency?
Note: Using Python 2.7.2 with scipy 0.11.0
Sparse matrices have a dot
method, so you can also go with
dot = x.T.dot(y)[0, 0]
But I personally find your code at least as good as the above.
I found it to be somewhat more efficient to use csc_matrix.multiply
:
In [77]: %timeit (x.transpose() * y)[0,0]
98.6 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [78]: %timeit x.T.dot(y)[0,0]
133 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [79]: %timeit x.multiply(y).sum(0)[0, 0]
77.7 µs ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [80]: %timeit x.multiply(y).sum()
86.1 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
It might also be worth pointing out that csc_array
behaves rather differently from csc_matrix
when it comes to these operations; the shapes come out differently, transpose
no longer works, but np.dot
works off the shelf:
In [85]: x2 = csc_array(x)
In [87]: y2 = csc_array(y)
In [108]: %timeit np.dot(x2, y2).sum()
70.9 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [115]: %timeit x2.T.dot(y2)[0, 0]
103 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [117]: %timeit x2.multiply(y2).sum(0)[0]
70.5 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [120]: %timeit x2.multiply(y2).sum()
59.6 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
And if you end up with rows instead of columns — i.e. csr_array
instead of csc_array
— then the fastest solution above is no longer the fastest (and is in fact extremely slow for very large vectors)!
In [123]: x3 = csr_array(x.T)
In [124]: y3 = csr_array(y.T)
In [130]: %timeit np.dot(x3, y3).sum()
72.2 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [134]: %timeit x3.dot(y3.T)[0, 0]
92.6 µs ± 358 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [139]: %timeit x3.multiply(y3).sum(1)[0]
62.7 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [140]: %timeit x3.multiply(y3).sum()
76.4 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
An example with a very large (but very sparse) vector:
In [142]: a = csr_array([1, 2], shape=(1, 1_000_000_000))
In [147]: %time np.dot(a, a).sum()
CPU times: user 205 ms, sys: 926 ms, total: 1.13 s
Wall time: 1.13 s
Out[147]: 5
In [149]: %time a.dot(a.T)[0, 0]
CPU times: user 854 ms, sys: 513 ms, total: 1.37 s
Wall time: 1.37 s
Out[149]: 5
In [151]: %time a.multiply(a).sum(1)[0]
CPU times: user 181 µs, sys: 0 ns, total: 181 µs
Wall time: 182 µs
Out[151]: 5
In [152]: %time a.multiply(a).sum()
CPU times: user 229 ms, sys: 902 ms, total: 1.13 s
Wall time: 1.13 s
Out[152]: 5
If you can use Cython or Numba, it’s possible to do it even faster. For Cython, an implementation to calculate the product of two shape (1, n)
csr_array
with float64
values which gives me some 10%-20% performance improvement over a.multiply(b).sum(1)[0]
could look something like the following:
@cython.boundscheck(False)
@cython.wraparound(False)
def cython_dot(a, b):
cdef int na = a.nnz
cdef int nb = b.nnz
cdef double[:] a_data_view = a.data
cdef int[:] a_indices_view = a.indices
cdef double[:] b_data_view = b.data
cdef int[:] b_indices_view = b.indices
cdef int a_ptr = -1
cdef int b_ptr = -1
cdef int aiv, biv
cdef double result = 0
cdef bint a_update = True
cdef bint b_update = True
while True:
if a_update:
a_ptr += 1
aiv = a_indices_view[a_ptr]
a_update = False
if b_update:
b_ptr += 1
biv = b_indices_view[b_ptr]
b_update = False
if a_ptr == na or b_ptr == nb:
break
if aiv < biv:
a_update = True
elif aiv > biv:
b_update = True
else:
result += a_data_view[a_ptr] * b_data_view[b_ptr]
a_update = True
b_update = True
return result
Here,
result += a_data_view[a_ptr] * b_data_view[b_ptr]
could also use libc.math.fma
if higher precision is necessary (but this makes it a bit slower in my testing):
result = fma(a_data_view[a_ptr], b_data_view[b_ptr], result)
A fast Numba implementation could be similar and is about as fast in my testing.