Working with big data in python and numpy, not enough ram, how to save partial results on disc?

Question

I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx, and other useful libraries. I want to perform operations such as pairwise distance between all of the points and do clustering on all of the points. I have implemented working algorithms that perform what I want with reasonable complexity but when I try to scale them to all of my data I run out of RAM. Of course, I do, creating the matrix for pairwise distances on 200k+ data takes a lot of memory.

Here comes the catch: I would really like to do this on crappy computers with low amounts of RAM.

Is there a feasible way for me to make this work without the constraints of low RAM? That it will take a much longer time is really not a problem, as long as the time reqs don’t go to infinity!

I would like to be able to put my algorithms to work and then come back an hour or five later and not have it stuck because it ran out of RAM! I would like to implement this in python, and be able to use the numpy, scipy, sklearn, and networkx libraries. I would like to be able to calculate the pairwise distance to all my points etc

Is this feasible? And how would I go about it, what can I start to read up on?

Asked By: Ekgren

||

Source

Answer 1

Using numpy.memmap you create arrays directly mapped into a file:

import numpy
a = numpy.memmap('test.mymemmap', dtype='float32', mode='w+', shape=(200000,1000))
# here you will see a 762MB file created in your working directory

You can treat it as a conventional array:
a += 1000.

It is possible even to assign more arrays to the same file, controlling it from mutually sources if needed. But I’ve experiences some tricky things here. To open the full array you have to “close” the previous one first, using del:

del a    
b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(200000,1000))

But openning only some part of the array makes it possible to achieve the simultaneous control:

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000))
b[1,5] = 123456.
print a[1,5]
#123456.0

Great! a was changed together with b. And the changes are already written on disk.

The other important thing worth commenting is the offset. Suppose you want to take not the first 2 lines in b, but lines 150000 and 150001.

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000),
                 offset=150000*1000*32/8)
b[1,2] = 999999.
print a[150001,2]
#999999.0

Now you can access and update any part of the array in simultaneous operations. Note the byte-size going in the offset calculation. So for a ‘float64’ this example would be 150000*1000*64/8.

Other references:

Answered By: Saullo G. P. Castro

Working with big data in python and numpy, not enough ram, how to save partial results on disc?

Question:

Answers: