# Save / load scipy sparse csr_matrix in portable data format

## Question:

How do you save/load a scipy sparse `csr_matrix`

in a portable format? The scipy sparse matrix is created on Python 3 (Windows 64-bit) to run on Python 2 (Linux 64-bit). Initially, I used pickle (with protocol=2 and fix_imports=True) but this didn’t work going from Python 3.2.2 (Windows 64-bit) to Python 2.7.2 (Windows 32-bit) and got the error:

```
TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).
```

Next, tried `numpy.save`

and `numpy.load`

as well as `scipy.io.mmwrite()`

and `scipy.io.mmread()`

and none of these methods worked either.

## Answers:

Assuming you have scipy on both machines, you can just use `pickle`

.

However, be sure to specify a binary protocol when pickling numpy arrays. Otherwise you’ll wind up with a huge file.

At any rate, you should be able to do this:

```
import cPickle as pickle
import numpy as np
import scipy.sparse
# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)
with open('test_sparse_array.dat', 'wb') as outfile:
pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)
```

You can then load it with:

```
import cPickle as pickle
with open('test_sparse_array.dat', 'rb') as infile:
x = pickle.load(infile)
```

**edit:** scipy 0.19 now has `scipy.sparse.save_npz`

and `scipy.sparse.load_npz`

.

```
from scipy import sparse
sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")
```

For both functions, the `file`

argument may also be a file-like object (i.e. the result of `open`

) instead of a filename.

Got an answer from the Scipy user group:

A csr_matrix has 3 data attributes that matter:

`.data`

,`.indices`

, and`.indptr`

. All are simple ndarrays, so`numpy.save`

will work on them. Save the three arrays with`numpy.save`

or`numpy.savez`

, load them back with`numpy.load`

, and then recreate the sparse matrix object with:

```
new_csr = csr_matrix((data, indices, indptr), shape=(M, N))
```

So for example:

```
def save_sparse_csr(filename, array):
np.savez(filename, data=array.data, indices=array.indices,
indptr=array.indptr, shape=array.shape)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
```

Though you write, `scipy.io.mmwrite`

and `scipy.io.mmread`

don’t work for you, I just want to add how they work. This question is the no. 1 Google hit, so I myself started with `np.savez`

and `pickle.dump`

before switching to the simple and obvious scipy-functions. They work for me and shouldn’t be overseen by those who didn’t tried them yet.

```
from scipy import sparse, io
m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>
io.mmwrite("test.mtx", m)
del m
newm = io.mmread("test.mtx")
newm # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr() # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)
```

This is what I used to save a `lil_matrix`

.

```
import numpy as np
from scipy.sparse import lil_matrix
def save_sparse_lil(filename, array):
# use np.savez_compressed(..) for compression
np.savez(filename, dtype=array.dtype.str, data=array.data,
rows=array.rows, shape=array.shape)
def load_sparse_lil(filename):
loader = np.load(filename)
result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"]))
result.data = loader["data"]
result.rows = loader["rows"]
return result
```

I must say I found NumPy’s np.load(..) to be *very slow*. This is my current solution, I feel runs much faster:

```
from scipy.sparse import lil_matrix
import numpy as np
import json
def lil_matrix_to_dict(myarray):
result = {
"dtype": myarray.dtype.str,
"shape": myarray.shape,
"data": myarray.data,
"rows": myarray.rows
}
return result
def lil_matrix_from_dict(mydict):
result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"])
result.data = np.array(mydict["data"])
result.rows = np.array(mydict["rows"])
return result
def load_lil_matrix(filename):
result = None
with open(filename, "r", encoding="utf-8") as infile:
mydict = json.load(infile)
result = lil_matrix_from_dict(mydict)
return result
def save_lil_matrix(filename, myarray):
with open(filename, "w", encoding="utf-8") as outfile:
mydict = lil_matrix_to_dict(myarray)
json.dump(mydict, outfile)
```

I was asked to send the matrix in a simple and generic format:

```
<x,y,value>
```

I ended up with this:

```
def save_sparse_matrix(m,filename):
thefile = open(filename, 'w')
nonZeros = np.array(m.nonzero())
for entry in range(nonZeros.shape[1]):
thefile.write("%s,%s,%sn" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))
```

Here is performance comparison of the three most upvoted answers using Jupyter notebook. The input is a 1M x 100K random sparse matrix with density 0.001, containing 100M non-zero values:

```
from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
```

`io.mmwrite`

/ `io.mmread`

```
from scipy.sparse import io
%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s
%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>
Filesize: 3.0G.
```

(note that the format has been changed from csr to coo).

`np.savez`

/ `np.load`

```
import numpy as np
from scipy.sparse import csr_matrix
def save_sparse_csr(filename, array):
# note that .npz extension is added automatically
np.savez(filename, data=array.data, indices=array.indices,
indptr=array.indptr, shape=array.shape)
def load_sparse_csr(filename):
# here we need to add .npz extension manually
loader = np.load(filename + '.npz')
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s
%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Filesize: 1.1G.
```

`cPickle`

```
import cPickle as pickle
def save_pickle(matrix, filename):
with open(filename, 'wb') as outfile:
pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
with open(filename, 'rb') as infile:
matrix = pickle.load(infile)
return matrix
%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s
%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Filesize: 1.1G.
```

**Note**: cPickle does not work with very large objects (see this answer).

In my experience, it didn’t work for a 2.7M x 50k matrix with 270M non-zero values.

`np.savez`

solution worked well.

## Conclusion

(based on this simple test for CSR matrices)

`cPickle`

is the fastest method, but it doesn’t work with very large matrices, `np.savez`

is only slightly slower, while `io.mmwrite`

is much slower, produces bigger file and restores to the wrong format. So `np.savez`

is the winner here.

Now you can use `scipy.sparse.save_npz`

:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html

As of scipy 0.19.0, you can save and load sparse matrices this way:

```
from scipy import sparse
data = sparse.csr_matrix((3, 4))
#Save
sparse.save_npz('data_sparse.npz', data)
#Load
data = sparse.load_npz("data_sparse.npz")
```

*EDIT* Apparently it is simple enough to:

```
def sparse_matrix_tuples(m):
yield from m.todok().items()
```

Which will yield a `((i, j), value)`

tuples, which are easy to serialize and deserialize. Not sure how it compares performance-wise with the code below for `csr_matrix`

, but it’s definitely simpler. I’m leaving the original answer below as I hope it’s informative.

Adding my two cents: for me, `npz`

is not portable as I can’t use it to export my matrix easily to non-Python clients (e.g. PostgreSQL — glad to be corrected). So I would have liked to get CSV output for the sparse matrix (much like you would get it you `print()`

the sparse matrix). How to achieve this depends on the representation of the sparse matrix. For a CSR matrix, the following code spits out CSV output. You can adapt for other representations.

```
import numpy as np
def csr_matrix_tuples(m):
# not using unique will lag on empty elements
uindptr, uindptr_i = np.unique(m.indptr, return_index=True)
for i, (start_index, end_index) in zip(uindptr_i, zip(uindptr[:-1], uindptr[1:])):
for j, data in zip(m.indices[start_index:end_index], m.data[start_index:end_index]):
yield (i, j, data)
for i, j, data in csr_matrix_tuples(my_csr_matrix):
print(i, j, data, sep=',')
```

It’s about 2 times slower than `save_npz`

in the current implementation, from what I’ve tested.

This works for me:

```
import numpy as np
import scipy.sparse as sp
x = sp.csr_matrix([1,2,3])
y = sp.csr_matrix([2,3,4])
np.savez(file, x=x, y=y)
npz = np.load(file)
>>> npz['x'].tolist()
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> npz['x'].tolist().toarray()
array([[1, 2, 3]], dtype=int64)
```

The trick was to call `.tolist()`

to convert the shape 0 object array to the original object.