# Loading Matlab sparse matrix saved with -v7.3 (HDF5) into Python and operating on it

## Question:

I’m new to python, coming from matlab. I have a large sparse matrix saved in matlab v7.3 (HDF5) format. I’ve so far found two ways of loading in the file, using `h5py`

and `tables`

. However operating on the matrix seems to be extremely slow after either. For example, in matlab:

```
>> whos
Name Size Bytes Class Attributes
M 11337x133338 77124408 double sparse
>> tic, sum(M(:)); toc
Elapsed time is 0.086233 seconds.
```

Using tables:

```
t = time.time()
sum(f.root.M.data)
elapsed = time.time() - t
print elapsed
35.929461956
```

Using h5py:

```
t = time.time()
sum(f["M"]["data"])
elapsed = time.time() - t
print elapsed
```

(I gave up waiting …)

[EDIT]

Based on the comments from @bpgergo, I should add that I’ve tried converting the result loaded in by `h5py`

(`f`

) into a `numpy`

array or a `scipy`

sparse array in the following two ways:

```
from scipy import sparse
A = sparse.csc_matrix((f["M"]["data"], f["M"]["ir"], f["tfidf"]["jc"]))
```

or

```
data = numpy.asarray(f["M"]["data"])
ir = numpy.asarray(f["M"]["ir"])
jc = numpy.asarray(f["M"]["jc"])
A = sparse.coo_matrix(data, (ir, jc))
```

but both of these operations are extremely slow as well.

Is there something I’m missing here?

## Answers:

You’re missing NumPy; here is a guide for Matlab users.

The final answer for posterity:

```
import tables, warnings
from scipy import sparse
def load_sparse_matrix(fname) :
warnings.simplefilter("ignore", UserWarning)
f = tables.openFile(fname)
M = sparse.csc_matrix( (f.root.M.data[...], f.root.M.ir[...], f.root.M.jc[...]) )
f.close()
return M
```

Most of your problem is that you’re using python `sum`

on what’s effectively a memory-mapped array (i.e. it’s on disk, not in memory).

First off, you’re comparing the time it takes to read things from disk to the time it takes to read things in memory. Load the array into memory first, if you want to compare to what you’re doing in matlab.

Secondly, python’s builtin `sum`

is very inefficent for numpy arrays. (Or, rather, iterating through every item of a numpy array independently is very slow, which is what python’s builtin `sum`

is doing.) Use `numpy.sum(yourarray)`

or `yourarray.sum()`

instead for numpy arrays.

As an example:

(Using `h5py`

, because I’m more familiar with it.)

```
import h5py
import numpy as np
f = h5py.File('yourfile.hdf', 'r')
dataset = f['/M/data']
# Load the entire array into memory, like you're doing for matlab...
data = np.empty(dataset.shape, dataset.dtype)
dataset.read_direct(data)
print data.sum() #Or alternately, "np.sum(data)"
```