Save / load scipy sparse csr_matrix in portable data format
Question:
How do you save/load a scipy sparse csr_matrix
in a portable format? The scipy sparse matrix is created on Python 3 (Windows 64-bit) to run on Python 2 (Linux 64-bit). Initially, I used pickle (with protocol=2 and fix_imports=True) but this didn’t work going from Python 3.2.2 (Windows 64-bit) to Python 2.7.2 (Windows 32-bit) and got the error:
TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).
Next, tried numpy.save
and numpy.load
as well as scipy.io.mmwrite()
and scipy.io.mmread()
and none of these methods worked either.
Answers:
Assuming you have scipy on both machines, you can just use pickle
.
However, be sure to specify a binary protocol when pickling numpy arrays. Otherwise you’ll wind up with a huge file.
At any rate, you should be able to do this:
import cPickle as pickle
import numpy as np
import scipy.sparse
# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)
with open('test_sparse_array.dat', 'wb') as outfile:
pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)
You can then load it with:
import cPickle as pickle
with open('test_sparse_array.dat', 'rb') as infile:
x = pickle.load(infile)
edit: scipy 0.19 now has scipy.sparse.save_npz
and scipy.sparse.load_npz
.
from scipy import sparse
sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")
For both functions, the file
argument may also be a file-like object (i.e. the result of open
) instead of a filename.
Got an answer from the Scipy user group:
A csr_matrix has 3 data attributes that matter: .data
, .indices
, and .indptr
. All are simple ndarrays, so numpy.save
will work on them. Save the three arrays with numpy.save
or numpy.savez
, load them back with numpy.load
, and then recreate the sparse matrix object with:
new_csr = csr_matrix((data, indices, indptr), shape=(M, N))
So for example:
def save_sparse_csr(filename, array):
np.savez(filename, data=array.data, indices=array.indices,
indptr=array.indptr, shape=array.shape)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
Though you write, scipy.io.mmwrite
and scipy.io.mmread
don’t work for you, I just want to add how they work. This question is the no. 1 Google hit, so I myself started with np.savez
and pickle.dump
before switching to the simple and obvious scipy-functions. They work for me and shouldn’t be overseen by those who didn’t tried them yet.
from scipy import sparse, io
m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>
io.mmwrite("test.mtx", m)
del m
newm = io.mmread("test.mtx")
newm # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr() # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)
This is what I used to save a lil_matrix
.
import numpy as np
from scipy.sparse import lil_matrix
def save_sparse_lil(filename, array):
# use np.savez_compressed(..) for compression
np.savez(filename, dtype=array.dtype.str, data=array.data,
rows=array.rows, shape=array.shape)
def load_sparse_lil(filename):
loader = np.load(filename)
result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"]))
result.data = loader["data"]
result.rows = loader["rows"]
return result
I must say I found NumPy’s np.load(..) to be very slow. This is my current solution, I feel runs much faster:
from scipy.sparse import lil_matrix
import numpy as np
import json
def lil_matrix_to_dict(myarray):
result = {
"dtype": myarray.dtype.str,
"shape": myarray.shape,
"data": myarray.data,
"rows": myarray.rows
}
return result
def lil_matrix_from_dict(mydict):
result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"])
result.data = np.array(mydict["data"])
result.rows = np.array(mydict["rows"])
return result
def load_lil_matrix(filename):
result = None
with open(filename, "r", encoding="utf-8") as infile:
mydict = json.load(infile)
result = lil_matrix_from_dict(mydict)
return result
def save_lil_matrix(filename, myarray):
with open(filename, "w", encoding="utf-8") as outfile:
mydict = lil_matrix_to_dict(myarray)
json.dump(mydict, outfile)
I was asked to send the matrix in a simple and generic format:
<x,y,value>
I ended up with this:
def save_sparse_matrix(m,filename):
thefile = open(filename, 'w')
nonZeros = np.array(m.nonzero())
for entry in range(nonZeros.shape[1]):
thefile.write("%s,%s,%sn" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))
Here is performance comparison of the three most upvoted answers using Jupyter notebook. The input is a 1M x 100K random sparse matrix with density 0.001, containing 100M non-zero values:
from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
io.mmwrite
/ io.mmread
from scipy.sparse import io
%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s
%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>
Filesize: 3.0G.
(note that the format has been changed from csr to coo).
np.savez
/ np.load
import numpy as np
from scipy.sparse import csr_matrix
def save_sparse_csr(filename, array):
# note that .npz extension is added automatically
np.savez(filename, data=array.data, indices=array.indices,
indptr=array.indptr, shape=array.shape)
def load_sparse_csr(filename):
# here we need to add .npz extension manually
loader = np.load(filename + '.npz')
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s
%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Filesize: 1.1G.
cPickle
import cPickle as pickle
def save_pickle(matrix, filename):
with open(filename, 'wb') as outfile:
pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
with open(filename, 'rb') as infile:
matrix = pickle.load(infile)
return matrix
%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s
%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Filesize: 1.1G.
Note: cPickle does not work with very large objects (see this answer).
In my experience, it didn’t work for a 2.7M x 50k matrix with 270M non-zero values.
np.savez
solution worked well.
Conclusion
(based on this simple test for CSR matrices)
cPickle
is the fastest method, but it doesn’t work with very large matrices, np.savez
is only slightly slower, while io.mmwrite
is much slower, produces bigger file and restores to the wrong format. So np.savez
is the winner here.
Now you can use scipy.sparse.save_npz
:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html
As of scipy 0.19.0, you can save and load sparse matrices this way:
from scipy import sparse
data = sparse.csr_matrix((3, 4))
#Save
sparse.save_npz('data_sparse.npz', data)
#Load
data = sparse.load_npz("data_sparse.npz")
EDIT Apparently it is simple enough to:
def sparse_matrix_tuples(m):
yield from m.todok().items()
Which will yield a ((i, j), value)
tuples, which are easy to serialize and deserialize. Not sure how it compares performance-wise with the code below for csr_matrix
, but it’s definitely simpler. I’m leaving the original answer below as I hope it’s informative.
Adding my two cents: for me, npz
is not portable as I can’t use it to export my matrix easily to non-Python clients (e.g. PostgreSQL — glad to be corrected). So I would have liked to get CSV output for the sparse matrix (much like you would get it you print()
the sparse matrix). How to achieve this depends on the representation of the sparse matrix. For a CSR matrix, the following code spits out CSV output. You can adapt for other representations.
import numpy as np
def csr_matrix_tuples(m):
# not using unique will lag on empty elements
uindptr, uindptr_i = np.unique(m.indptr, return_index=True)
for i, (start_index, end_index) in zip(uindptr_i, zip(uindptr[:-1], uindptr[1:])):
for j, data in zip(m.indices[start_index:end_index], m.data[start_index:end_index]):
yield (i, j, data)
for i, j, data in csr_matrix_tuples(my_csr_matrix):
print(i, j, data, sep=',')
It’s about 2 times slower than save_npz
in the current implementation, from what I’ve tested.
This works for me:
import numpy as np
import scipy.sparse as sp
x = sp.csr_matrix([1,2,3])
y = sp.csr_matrix([2,3,4])
np.savez(file, x=x, y=y)
npz = np.load(file)
>>> npz['x'].tolist()
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> npz['x'].tolist().toarray()
array([[1, 2, 3]], dtype=int64)
The trick was to call .tolist()
to convert the shape 0 object array to the original object.
How do you save/load a scipy sparse csr_matrix
in a portable format? The scipy sparse matrix is created on Python 3 (Windows 64-bit) to run on Python 2 (Linux 64-bit). Initially, I used pickle (with protocol=2 and fix_imports=True) but this didn’t work going from Python 3.2.2 (Windows 64-bit) to Python 2.7.2 (Windows 32-bit) and got the error:
TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).
Next, tried numpy.save
and numpy.load
as well as scipy.io.mmwrite()
and scipy.io.mmread()
and none of these methods worked either.
Assuming you have scipy on both machines, you can just use pickle
.
However, be sure to specify a binary protocol when pickling numpy arrays. Otherwise you’ll wind up with a huge file.
At any rate, you should be able to do this:
import cPickle as pickle
import numpy as np
import scipy.sparse
# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)
with open('test_sparse_array.dat', 'wb') as outfile:
pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)
You can then load it with:
import cPickle as pickle
with open('test_sparse_array.dat', 'rb') as infile:
x = pickle.load(infile)
edit: scipy 0.19 now has scipy.sparse.save_npz
and scipy.sparse.load_npz
.
from scipy import sparse
sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")
For both functions, the file
argument may also be a file-like object (i.e. the result of open
) instead of a filename.
Got an answer from the Scipy user group:
A csr_matrix has 3 data attributes that matter:
.data
,.indices
, and.indptr
. All are simple ndarrays, sonumpy.save
will work on them. Save the three arrays withnumpy.save
ornumpy.savez
, load them back withnumpy.load
, and then recreate the sparse matrix object with:
new_csr = csr_matrix((data, indices, indptr), shape=(M, N))
So for example:
def save_sparse_csr(filename, array):
np.savez(filename, data=array.data, indices=array.indices,
indptr=array.indptr, shape=array.shape)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
Though you write, scipy.io.mmwrite
and scipy.io.mmread
don’t work for you, I just want to add how they work. This question is the no. 1 Google hit, so I myself started with np.savez
and pickle.dump
before switching to the simple and obvious scipy-functions. They work for me and shouldn’t be overseen by those who didn’t tried them yet.
from scipy import sparse, io
m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>
io.mmwrite("test.mtx", m)
del m
newm = io.mmread("test.mtx")
newm # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr() # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)
This is what I used to save a lil_matrix
.
import numpy as np
from scipy.sparse import lil_matrix
def save_sparse_lil(filename, array):
# use np.savez_compressed(..) for compression
np.savez(filename, dtype=array.dtype.str, data=array.data,
rows=array.rows, shape=array.shape)
def load_sparse_lil(filename):
loader = np.load(filename)
result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"]))
result.data = loader["data"]
result.rows = loader["rows"]
return result
I must say I found NumPy’s np.load(..) to be very slow. This is my current solution, I feel runs much faster:
from scipy.sparse import lil_matrix
import numpy as np
import json
def lil_matrix_to_dict(myarray):
result = {
"dtype": myarray.dtype.str,
"shape": myarray.shape,
"data": myarray.data,
"rows": myarray.rows
}
return result
def lil_matrix_from_dict(mydict):
result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"])
result.data = np.array(mydict["data"])
result.rows = np.array(mydict["rows"])
return result
def load_lil_matrix(filename):
result = None
with open(filename, "r", encoding="utf-8") as infile:
mydict = json.load(infile)
result = lil_matrix_from_dict(mydict)
return result
def save_lil_matrix(filename, myarray):
with open(filename, "w", encoding="utf-8") as outfile:
mydict = lil_matrix_to_dict(myarray)
json.dump(mydict, outfile)
I was asked to send the matrix in a simple and generic format:
<x,y,value>
I ended up with this:
def save_sparse_matrix(m,filename):
thefile = open(filename, 'w')
nonZeros = np.array(m.nonzero())
for entry in range(nonZeros.shape[1]):
thefile.write("%s,%s,%sn" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))
Here is performance comparison of the three most upvoted answers using Jupyter notebook. The input is a 1M x 100K random sparse matrix with density 0.001, containing 100M non-zero values:
from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
io.mmwrite
/ io.mmread
from scipy.sparse import io
%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s
%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>
Filesize: 3.0G.
(note that the format has been changed from csr to coo).
np.savez
/ np.load
import numpy as np
from scipy.sparse import csr_matrix
def save_sparse_csr(filename, array):
# note that .npz extension is added automatically
np.savez(filename, data=array.data, indices=array.indices,
indptr=array.indptr, shape=array.shape)
def load_sparse_csr(filename):
# here we need to add .npz extension manually
loader = np.load(filename + '.npz')
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s
%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Filesize: 1.1G.
cPickle
import cPickle as pickle
def save_pickle(matrix, filename):
with open(filename, 'wb') as outfile:
pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
with open(filename, 'rb') as infile:
matrix = pickle.load(infile)
return matrix
%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s
%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Filesize: 1.1G.
Note: cPickle does not work with very large objects (see this answer).
In my experience, it didn’t work for a 2.7M x 50k matrix with 270M non-zero values.
np.savez
solution worked well.
Conclusion
(based on this simple test for CSR matrices)
cPickle
is the fastest method, but it doesn’t work with very large matrices, np.savez
is only slightly slower, while io.mmwrite
is much slower, produces bigger file and restores to the wrong format. So np.savez
is the winner here.
Now you can use scipy.sparse.save_npz
:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html
As of scipy 0.19.0, you can save and load sparse matrices this way:
from scipy import sparse
data = sparse.csr_matrix((3, 4))
#Save
sparse.save_npz('data_sparse.npz', data)
#Load
data = sparse.load_npz("data_sparse.npz")
EDIT Apparently it is simple enough to:
def sparse_matrix_tuples(m):
yield from m.todok().items()
Which will yield a ((i, j), value)
tuples, which are easy to serialize and deserialize. Not sure how it compares performance-wise with the code below for csr_matrix
, but it’s definitely simpler. I’m leaving the original answer below as I hope it’s informative.
Adding my two cents: for me, npz
is not portable as I can’t use it to export my matrix easily to non-Python clients (e.g. PostgreSQL — glad to be corrected). So I would have liked to get CSV output for the sparse matrix (much like you would get it you print()
the sparse matrix). How to achieve this depends on the representation of the sparse matrix. For a CSR matrix, the following code spits out CSV output. You can adapt for other representations.
import numpy as np
def csr_matrix_tuples(m):
# not using unique will lag on empty elements
uindptr, uindptr_i = np.unique(m.indptr, return_index=True)
for i, (start_index, end_index) in zip(uindptr_i, zip(uindptr[:-1], uindptr[1:])):
for j, data in zip(m.indices[start_index:end_index], m.data[start_index:end_index]):
yield (i, j, data)
for i, j, data in csr_matrix_tuples(my_csr_matrix):
print(i, j, data, sep=',')
It’s about 2 times slower than save_npz
in the current implementation, from what I’ve tested.
This works for me:
import numpy as np
import scipy.sparse as sp
x = sp.csr_matrix([1,2,3])
y = sp.csr_matrix([2,3,4])
np.savez(file, x=x, y=y)
npz = np.load(file)
>>> npz['x'].tolist()
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> npz['x'].tolist().toarray()
array([[1, 2, 3]], dtype=int64)
The trick was to call .tolist()
to convert the shape 0 object array to the original object.