Processing large amount of data in Python

Question:

I have been trying to process a good chunk of data (a few GBs) but my personal computer resists to do it in a reasonable time span, so I was wondering what options do I have? I was using python’s csv.reader but it was painfully slow even to fetch 200,000 lines. Then I migrated this data to an sqlite database which retrieved results a bit faster and without using so much memory but slowness was still a major issue.

So, again… what options do I have to process this data? I was wondering about using amazon’s spot instances which seem useful for this kind of purpose but maybe there are other solutions to explore.

Supposing that spot instances is a good option and considering I have never used them before, I’d like to ask what can I expect from them? Does anyone have experience using them for this kind of thing? If so, What is your workflow? I thought I could find a few blog posts detailing workflows for scientific computing, image processing or that kind of thing but I didn’t find anything so if you can explain a bit of that or point out some links, I’d appreciate it.

Thanks in advance.

Asked By: r_31415

||

Answers:

I would try to use numpy to work with your large datasets localy. Numpy arrays should use less memory compared csv.reader and computation times should be much faster when using vectorised numpy functions.

However there may be a memory problem when reading the file.
numpy.loadtxt or numpy.genfromtxt also consume a lot of memory when reading files.
If this is a problem some (brand new) alternative parser engines are compared here. According to this post, the new pandas (a library which is built on top of numpy) parser seems to be an option.

As mentioned in the comments I would also suggest to store your data in a binary format like HDF5 when you have read your files once. Loading the data from a HDF5 file is really fast from my experience (would be interesting to know how fast it is compared to sqlite in your case). The simplest way I know to save your array as HDF5 is with pandas

import pandas as pd

data = pd.read_csv(filename, options...)
store = pd.HDFStore('data.h5')
store['mydata'] = data
store.close()

loading your data is than as simple as

import pandas as pd

store = pd.HDFStore('data.h5')
data = store['mydata']
store.close()

 
Answered By: bmu

If you have to use python, you can try dumbo which allows you to run Hadoop programs in python. It’s very easy to start with. Then you can write your own code to do hadoop streaming to process your Big Data. Do check its short tutorial: https://github.com/klbostee/dumbo/wiki/Short-tutorial

A similar one is from yelp: https://github.com/Yelp/mrjob

Answered By: greeness