How do I operate on a huge matrix (100000×100000) stored as nested list?

Question:

Circumstances

I have a procedure which will construct a matrix using the given list of values!
and the list starts growing bigger like 100 thousand or million values in a list, which in turn, will result in million x million size matrix.

in the procedure, i am doing some add/sub/div/multiply operations on the matrix, either based on the row, the column or just the element.

Issues

since the matrix is so big that i don`t think doing the whole manipulation in the memory would work.

Questions

therefore, my question would be:
how should i manipulate this huge matrix and the huge value list?
like, where to store it, how to read it etc, so that i could carry out my operations on the matrix and the computer won`t stuck or anything.

Asked By: phoenixbai

||

Answers:

Have you considered using a dictionary? If the matrix is very sparse it might be feasible to store it as

matrix = {
 (101, 10213) : "value1",
 (1099, 78933) : "value2"
}
Answered By: Maria Zverina

First and foremost, such matrix would have 10G elements. Considering that for any useful operation you would then need 30G elements, each taking 4-8 bytes, you cannot assume to do this at all on a 32-bit computer using any sort of in-memory technique. To solve this, I would use a) genuine 64-bit machine, b) memory-mapped binary files for storage, and c) ditch python.

Update

And as I calculated below, if you have 2 input matrices and 1 output matrix, 100000 x 100000 32 bit float/integer elements, that is 120 GB (not quite GiB, though) of data. Assume, on a home computer you could achieve constant 100 MB/s I/O bandwidth, every single element of a matrix needs to be accessed for any operation including addition and subtraction, the absolute lower limit for operations would be 120 GB / (100 MB/s) = 1200 seconds, or 20 minutes, for a single matrix operation. Written in C, using the operating system as efficiently as possible, memmapped IO and so forth. For million by million elements, each operation takes 100 times as many time, that is 1.5 days. And as the hard disk is saturated during that time, the computer might just be completely unusable.

I suggest using NumPy. It’s quite fast on arithmetic operations.

Answered By: cval

Your data structure is not possible with arrays, it is too large. If the matrix is for instance a binary matrix you could look at representations for its storage like hashing larger blocks of zeros together to the same bucket.

Answered By: Niklas Rosencrantz
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.