Scipy Sparse Row/Column Dot Products

Question:

What is the readable and efficient way to compute the dot product between two columns or rows of a sparse matrix using scipy? Let’s say that we want to take the dot product of two vectors x and y, two columns of sparse matrix A, then I’m currently doing:

    x = A.getcol(i)
    y = A.getcol(j)
    dot = (x.transpose() * y)[0,0]

A is stored in csc form for efficiency. Is there a more readable way to get the dot product without sacrificing efficiency?

Note: Using Python 2.7.2 with scipy 0.11.0

Asked By: Jon

||

Answers:

Sparse matrices have a dot method, so you can also go with

dot = x.T.dot(y)[0, 0]

But I personally find your code at least as good as the above.

Answered By: Jaime

I found it to be somewhat more efficient to use csc_matrix.multiply:

In [77]: %timeit (x.transpose() * y)[0,0]
98.6 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [78]: %timeit x.T.dot(y)[0,0]
133 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [79]: %timeit x.multiply(y).sum(0)[0, 0]
77.7 µs ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [80]: %timeit x.multiply(y).sum()
86.1 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

It might also be worth pointing out that csc_array behaves rather differently from csc_matrix when it comes to these operations; the shapes come out differently, transpose no longer works, but np.dot works off the shelf:

In [85]: x2 = csc_array(x)

In [87]: y2 = csc_array(y)

In [108]: %timeit np.dot(x2, y2).sum()
70.9 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [115]: %timeit x2.T.dot(y2)[0, 0]
103 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [117]: %timeit x2.multiply(y2).sum(0)[0]
70.5 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [120]: %timeit x2.multiply(y2).sum()
59.6 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

And if you end up with rows instead of columns — i.e. csr_array instead of csc_array — then the fastest solution above is no longer the fastest (and is in fact extremely slow for very large vectors)!

In [123]: x3 = csr_array(x.T)

In [124]: y3 = csr_array(y.T)

In [130]: %timeit np.dot(x3, y3).sum()
72.2 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [134]: %timeit x3.dot(y3.T)[0, 0]
92.6 µs ± 358 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [139]: %timeit x3.multiply(y3).sum(1)[0]
62.7 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [140]: %timeit x3.multiply(y3).sum()
76.4 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

An example with a very large (but very sparse) vector:

In [142]: a = csr_array([1, 2], shape=(1, 1_000_000_000))

In [147]: %time np.dot(a, a).sum()
CPU times: user 205 ms, sys: 926 ms, total: 1.13 s
Wall time: 1.13 s
Out[147]: 5

In [149]: %time a.dot(a.T)[0, 0]
CPU times: user 854 ms, sys: 513 ms, total: 1.37 s
Wall time: 1.37 s
Out[149]: 5

In [151]: %time a.multiply(a).sum(1)[0]
CPU times: user 181 µs, sys: 0 ns, total: 181 µs
Wall time: 182 µs
Out[151]: 5

In [152]: %time a.multiply(a).sum()
CPU times: user 229 ms, sys: 902 ms, total: 1.13 s
Wall time: 1.13 s
Out[152]: 5

If you can use Cython or Numba, it’s possible to do it even faster. For Cython, an implementation to calculate the product of two shape (1, n) csr_array with float64 values which gives me some 10%-20% performance improvement over a.multiply(b).sum(1)[0] could look something like the following:

@cython.boundscheck(False)
@cython.wraparound(False)
def cython_dot(a, b):
    cdef int na = a.nnz
    cdef int nb = b.nnz
    cdef double[:] a_data_view = a.data
    cdef int[:] a_indices_view = a.indices
    cdef double[:] b_data_view = b.data
    cdef int[:] b_indices_view = b.indices
    cdef int a_ptr = -1
    cdef int b_ptr = -1
    cdef int aiv, biv
    cdef double result = 0
    cdef bint a_update = True
    cdef bint b_update = True
    while True:
        if a_update:
            a_ptr += 1
            aiv = a_indices_view[a_ptr]
            a_update = False
        if b_update:
            b_ptr += 1
            biv = b_indices_view[b_ptr]
            b_update = False
        if a_ptr == na or b_ptr == nb:
            break
        if aiv < biv:
            a_update = True
        elif aiv > biv:
            b_update = True
        else:
            result += a_data_view[a_ptr] * b_data_view[b_ptr]
            a_update = True
            b_update = True
    return result

Here,

result += a_data_view[a_ptr] * b_data_view[b_ptr]

could also use libc.math.fma if higher precision is necessary (but this makes it a bit slower in my testing):

result = fma(a_data_view[a_ptr], b_data_view[b_ptr], result)

A fast Numba implementation could be similar and is about as fast in my testing.

Answered By: fuglede