Python tools for out-of-core computation/data mining

Question:

I am interested in python mining data sets too big to sit in RAM but sitting within a single HD.

I understand that I can export the data as hdf5 files, using pytables. Also the numexpr allows for some basic out-of-core computation.

What would come next? Mini-batching when possible, and relying on linear algebra results to decompose the computation when mini-batching cannot be used?

Or are there some higher level tools I have missed?

Thanks for insights,

Asked By: user17375

Source

Answers:

What exactly do you want to do — can you give an example or two please ?

numpy.memmap is easy —

Create a memory-map to an array stored in a binary file on disk.
Memory-mapped files are used for accessing small segments of large
files on disk, without reading the entire file into memory. Numpy’s
memmap’s are array-like objects …

see also numpy+memmap on SO.

The scikit-learn people are very knowledgeable, but prefer specific questions.

Answered By: denis

I have a similar need to work on sub map-reduce sized datasets. I posed this question on SO when I started to investigate python pandas as a serious alternative to SAS: "Large data" work flows using pandas

The answer presented there suggests using the HDF5 interface from pandas to store pandas data structures directly on disk. Once stored, you could access the data in batches and train a model incrementally. For, example, scikit-learn has several classes that can be trained on incremental pieces of a dataset. One such example is found here:

http://scikit-learn.org/0.13/modules/generated/sklearn.linear_model.SGDClassifier.html

Any class that implements the partial_fit method can be trained incrementally. I am still trying to get a viable workflow for these kinds of problems and would be interested in discussing possible solutions.

Answered By: Zelazny7

In sklearn 0.14 (to be released in the coming days) there is a full-fledged example of out-of-core classification of text documents.

I think it could be a great example to start with :

http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html

In the next release we’ll extend this example with more classifiers and add documentation in the user guide.

NB: you can reproduce this example with 0.13 too, all the building blocks were already there.

Answered By: oDDsKooL

The goal of data mining is to gain actionable insights from an organization’s data: businesses decide what they can do based on what they know about the data they have.

Put another way, data mining is all about taking a huge amount of data and extracting insights from it, much like how physical mining extracts a small amount of precious metal from large piles of raw ore.

Data mining, however, uses statistics, code, and machine learning algorithms instead of explosives and smelting. Many of those data mining tools are provided by the Python programming language and its extensive ecosystem of third-party modules.

Answered By: TaazaaInc