Memory efficient dot product between a sparse matrix and a non-sparse numpy matrix

Question:

I have gone through similar questions that has been asked before (for example [1] [2]). However, none of them completely relevant for my problem.

I am trying to calculate a dot product between two large matrices and I have some memory constraint that I have to meet.

I have a numpy sparse matrix, which is a shape of (10000,600000). For example,

from scipy import sparse as sps
x = sps.random(m=10000, n=600000, density=0.1).toarray()

The second numpy matrix is of size (600000, 256), which consists of only (-1, 1).

import numpy as np
y = np.random.choice([-1,1], size=(600000, 256))

I need dot product of x and y at lowest possible memory required. Speed is not the primary concern.

Here is what I have tried so far:

Scipy Sparse Format:

Naturally, I converted the numpy sparse matrix to scipy csr_matrix. However, task is still getting killed due to memory issue. There is no error, I just get killed on the terminal.

from scipy import sparse as sps
sparse_x = sps.csr_matrix(x, copy=False)
z = sparse_x.dot(y)
# killed

Decreasing dtype precision + Scipy Sparse Format:

from scipy import sparse as sps

x = x.astype("float16", copy=False)
y = y.astype("int8", copy=False)

sparse_x = sps.csr_matrix(x, copy=False)
z = sparse_x.dot(y)
# Increases the memory requirement for some reason and dies 

np.einsum

Not sure if it helps/works with sparse matrix. Found something interesting in this answer. However, following doesn’t help either:

z = np.einsum('ij,jk->ik', x, y)
# similar memory requirement as the scipy sparse dot

Suggestions?

If you have any suggestions to improve any of these. Please let me know. Further, I am thinking in the following directions:

  1. It would be great If I can get rid of dot product itself somehow. My second matrix (i.e. y is randomly generated and it just has [-1, 1]. I am hoping if there is way I could take advantage of its features.

  2. May be diving dot product into several small dot product and then, aggregate.

Asked By: Grayrigel

||

Answers:

The deepest python code for M@x, where M is a sparse csr matrix, and x is a dense array is:

In [28]: M._mul_multivector??
Signature: M._mul_multivector(other)
Docstring: <no docstring>
Source:   
    def _mul_multivector(self, other):
        M, N = self.shape
        n_vecs = other.shape[1]  # number of column vectors

        result = np.zeros((M, n_vecs),
                          dtype=upcast_char(self.dtype.char, other.dtype.char))

        # csr_matvecs or csc_matvecs
        fn = getattr(_sparsetools, self.format + '_matvecs')
        fn(M, N, n_vecs, self.indptr, self.indices, self.data,
           other.ravel(), result.ravel())

        return result

The fn here is compiled code. It gets the attribute arrays of the csr matrix, the dense array as a 1d array, and the preallocated result array – also as a flattened view.

I wonder the memory error occurs when creating result. Because I suspect fn uses that memory as its workspace, and does not try use any more memory.

I know that the sparse@sparse matrix multiplication is done in two compiled steps. The first determines the dimensions and nnz of the result, and the second actually does the calculation. The python code between the two steps creates the result array, much as is being done here. And memory errors with this @ occur in the result allocation.

Answered By: hpaulj

If you have created your arrays without any memory problem (that I had for the prepared example sizes), why not just using iterations if could satisfy the needs. The dot product equivalent loops could be a solution, which can be accelerated with numba:

@nb.njit  # ("float64[:, ::1](float64[:, ::1], int32[:, ::1])")
def dot(a, b):

    dot_ = np.zeros((a.shape[0], b.shape[1]))
    for i in range(a.shape[1]):
        for j in range(a.shape[0]):
            for k in range(b.shape[1]):
                dot_[j, k] += a[j, i] * b[i, k]
    return dot_

the result will easily fit in memory for your prepared example as Warren mentioned in the comments.

Creating 1D (-1, 1) random arrays with shape (x.shape[1]) in loops (based on the desired y_shape) and dotting by each x row to fill an empty array with the desired shape (x.shape[0], y_shape) might be helpful to satisfy the memory constraint (perhaps better to use jitted numba loops instead np.dot) if there not be any other way, which is very inefficient in time (it could be handled by batches to avoid many of the loops), i.e.,:

def batch(a, y_shape):
    res = np.zeros((a.shape[0], y_shape))
    for row_ in range(a.shape[0]):
        for col_ in range(y_shape):
            res[row_, col_] = np.dot(a[row_, :], np.random.choice([-1, 1], size=a.shape[1]))
    return res
Answered By: Ali_Sh

TL;DR: SciPy consumes significantly more memory than strictly needed for this due to temporary arrays, type promotion, and due to an inefficient usage. It is also not very fast. The setup can be optimized so to use less memory and Numba can be used to perform the computation efficiently (both for the memory usage and time).


I have a numpy sparse matrix, which is a shape of (10000,600000). For example,
x = sps.random(m=10000, n=600000, density=0.1).toarray()

sps.random(...) is a COO sparse matrix so it is relatively compact in memory. Using .toarray() on it causes the sparse matrix to be converted to a huge dense matrix. Indeed, this resulting dense matrix (x) takes 10000*600000*8 = 44.7 GiB (since the default type for floating-point numbers is 64-bit wide).

This can likely cause memory issue. On some machines with a swap memory or a large RAM (eg. 64 GiB), the program can be much slower and allocating only few GiB of RAM can cause the process to be kill if the memory is closed to be saturated (eg. due to OOM killer on Linux). Note that Windows compresses data in RAM when the remaining memory is pretty limited (ie. ZRAM). This methods works well on arrays having a lot of zeros. However, when additional arrays are allocated later and the dense array is read back from RAM, the OS needs to uncompress data from the RAM and it needs more space. If the rest of the RAM content cannot be compressed so well during the decompression, an out of memory is likely to occur.

Naturally, I converted the numpy sparse matrix to scipy csr_matrix

CSR matrices are encoded as 3 1D arrays:

  • a data array containing the non-zero value of the source matrix;
  • a column index array referencing the location of the non-zero items of the current row;
  • a row index referencing the offset of the first item in the two first arrays.

In practice, the 2 last array are 32-bit integer arrays and likely 64-bit integers on Linux. Since your input matrix has 10% of non-zero values, this means data should take 10000*600000*8*0.1 = 4.5 GiB, column should take also the same size on Linux (since they both contains 64-bit values) and 2.2 GiB on Windows, row should take 10000*8 = 78 KiB on Linux and be even smaller on Windows. Thus, the CSR matrix should take about 9 GiB overall on Linux and 6.7 GiB on Windows. This is still pretty big.

Using copy=False is not a good idea since the CSR matrix will reference the huge initial array (AFAIK data will reference x). This means x needs to be kept in memory and so the resulting memory spec is about 44.7 GiB + 2.2~4.5 GiB = 46.9~49.2 GiB. This is not to mention matrix multiplication involving sparse matrices are generally slower (unless the sparsity factor is very small). It is better to copy the array content so to only keep the non-zero values with copy=True.

x.astype("float16", copy=False)

This is not possible to convert an array to another one with a different data-type without copying it especially if the size of each item is different in memory. Even if it would be, this make no sense to do that since the goal is to reduce the size of the input so to create a new array and not to keep the initial one.

# Increases the memory requirement for some reason and dies

There are 2 reasons for this.

Firstly, Scipy does not support the float16 data-type yet for sparse matrices. You can see this by checking sparse_x.dtype: it is float32 when x is float16. This is because the float16 data-type is not supported natively on most platforms (x86-64 processors mainly support the conversion from/to this type). As the result, the CSR matrix data part is twice bigger than what it could be.

Secondly, Numpy does not internally support binary operations on arrays with different data-types. Numpy first converts the inputs of binary operations so they matches. To do this, it follows semantics rules. This is called type promotion. SciPy generally works the same way (especially since it often makes use of Numpy internally). This means the int8 array will likely implicitly converted in a float32 array in your case, that is, a 4 times bigger array in memory. We need to delve into the implementation of SciPy to understand what is really happening.


Under the hood

Let’s understand what is going on internally in SciPy when we do the matrix multiplication. As pointed out by @hpaulj, sparse_x.dot(y) calls _mul_multivector which creates the output array and calls csr_matvecs. The last is a C++ wrapped function calling a template function instantiated by the macro SPTOOLS_CSR_DEFINE_TEMPLATE. This macro is provided to the other macro SPTOOLS_FOR_EACH_INDEX_DATA_TYPE_COMBINATION responsible for generating all the possible instances from a list of predefined supported types. The implementation of csr_matvecs is available here.

Based on this code, we can see the float16 data-type is not supported by SciPy (at least for sparse matrices). We can also see that self and other must have the same type when calling the C++ function csr_matvecs (the type promotion is certainly done before calling the C++ function using in a wrapping code like this one). We can also see that the C++ implementation is not particularly optimized and it is also simple. One can easily write a code having the same logic and the same performance using Numba or Cython so to support your specific compact input type. We can finally see than nothing is allocated in the C++ computing part (only in the Python code and certainly in the C++ wrapper).


Memory efficient implementation

First of all, here is a code to set the array in a compact way:

from scipy import sparse as sps
import numpy as np

sparse_x = sps.random(m=10_000, n=600_000, density=0.1, dtype=np.float32)
sparse_x = sps.csr_matrix(sparse_x) # Efficient conversion

y = np.random.randint(0, 2, size=(600_000, 256), dtype=np.int8)
np.subtract(np.multiply(y, 2, out=y), 1, out=y) # In-place modification

One simple pure-Numpy solution to mitigate the overhead of temporary arrays is to compute the matrix multiplication by chunk. Here is an example computing the matrix band by band:

chunk_count = 4
m, n = sparse_x.shape
p, q = y.shape
assert n == p
result = np.empty((m, q), dtype=np.float16)
for i in range(chunk_count):
    start, end = m*i//chunk_count, m*(i+1)//chunk_count
    result[start:end,:] = sparse_x[start:end,:] @ y

This is <5% slower on my machine and it takes less memory since the temporary arrays of only 1 band of the output matrix are created at a time. If you still have memory issues using this code, then please check if the output matrix can actually fit in RAM using:
for l in result: l[:] = np.random.rand(l.size). Indeed, creating an array does not mean the memory space is reserved in RAM (see this post).

A more memory efficient solution and faster one is to use Numba or Cython so to do what Scipy does manually without creating any temporary array. The bad news is that Numba does not support the float16 data-type yet so float32 needs to be used instead. Here is an implementation:

import numba as nb

# Only for CSR sparse matrices
@nb.njit(['(float32[::1], int32[::1], int32[::1], int8[:,::1])', '(float32[::1], int64[::1], int64[::1], int8[:,::1])'], fastmath=True, parallel=True)
def sparse_compute(x_data, x_cols, x_rows, y):
    result = np.empty((x_rows.size-1, y.shape[1]), dtype=np.float32)
    for i in nb.prange(x_rows.size-1):
        line = np.zeros(y.shape[1], dtype=np.float32)
        for j in range(x_rows[i], x_rows[i+1]):
            factor = x_data[j]
            y_line = y[x_cols[j],:]
            for k in range(line.size):
                line[k] += y_line[k] * factor
        for k in range(line.size):
            result[i, k] = line[k]
    return result

z = sparse_compute(sparse_x.data, sparse_x.indices, sparse_x.indptr, y)

This is significantly faster than the previous solution and it also consumes far less memory. Indeed, it consumes only 10 MiB to compute the result (that is 50-200 times less than the initial solution on my machine), and it is about 3.5 times faster than SciPy on a 10 year old mobile processor with only 2 core (i7-3520M)!


Generalization on LIL sparse matrices

The above Numba code is only for CSR matrices. LIL matrices are not efficient. They are designed for building matrices quickly, but inefficient for computations. The doc say to convert them to CSR/CSC matrices (once created) to do computations. LIL matrices are internally 2 Numpy array of CPython list objects. Lists are inefficient and cannot be much optimized by Numba/Cython/C because of how CPython lists are designed. They also use a lot of memory (COO is certainly better for that). Indeed, on mainstream 64-bit platforms and using CPython, each list item is an object and a CPython object typically takes 32 bytes, not to mention the 8 byte taken by the reference in the CPython list. 2 objects are needed per non-zero value, so 80 byte per non-zero value. As a result, the resulting LIL input matrix is pretty huge.

Since Numba does not really support CPython lists, lists can be converted to Numpy arrays on the fly so to be able to perform a relatively fast memory-efficient computation. Fortunately, this conversion is not so expensive on large matrices. Here is an implementation:

@nb.njit('(float32[::1], int32[::1], int8[:,::1], float32[::1])', fastmath=True)
def nb_compute_lil_row(row_data, row_idx, y, result):
    assert row_data.size == row_idx.size
    assert result.size == y.shape[1]
    for i in range(row_data.size):
        factor = row_data[i]
        y_line = y[row_idx[i],:]
        for k in range(y_line.size):
            result[k] += y_line[k] * factor

def nb_compute_lil(sparse_x, y):
    result = np.zeros((sparse_x.shape[0], y.shape[1]), dtype=np.float32)
    for i in range(len(sparse_x.rows)):
        row_data = np.fromiter(sparse_x.data[i], np.float32)
        row_idx = np.fromiter(sparse_x.rows[i], np.int32)
        nb_compute_lil_row(row_data, row_idx, y, result[i])
    return result

z = nb_compute_lil(sparse_x, y)

This code is twice slower than Scipy on my machine but it consumes less memory. The worst performance is due to Numba failing to generate a fast SIMD code in this specific case (due to missed optimizations of the internal LLVM-JIT compiler).

Answered By: Jérôme Richard