large amount of data in many text files – how to process?

Question:

I have large amounts of data (a few terabytes) and accumulating… They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to

  1. write the whole thing in C (or
    Fortran)
  2. import the files (tables) into a
    relational database directly and
    then pull off chunks in R or Python
    (some of the transformations are not
    amenable for pure SQL solutions)
  3. write the whole thing in Python

Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn’t anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks

Edit Thanks for your responses. There seems to be conflicting opinions about Hadoop, but in any case I don’t have access to a cluster (though I can use several unnetworked machines)…

Asked By: hatmatrix

||

Answers:

(3) is not necessarily a bad idea — Python makes it easy to process “CSV” file (and despite the C standing for Comma, tab as a separator is just as easy to handle) and of course gets just about as much bandwidth in I/O ops as any other language. As for other recommendations, numpy, besides fast computation (which you may not need as per your statements) provides very handy, flexible multi-dimensional arrays, which may be quite handy for your tasks; and the standard library module multiprocessing lets you exploit multiple cores for any task that’s easy to parallelize (important since just about every machine these days has multi-cores;-).

Answered By: Alex Martelli

Have a look at Disco. It is a lightweight distributed MapReduce engine, written in about 2000 lines of Erlang, but specifically designed for Python development. It supports not only working on your data, but also storing an replication reliably. They’ve just released version 0.3, which includes an indexing and database layer.

Answered By: Marcelo Cantos

In case you have a cluster of machines you can parallelize your application using Hadoop Mapreduce. Although Hadoop is written in Java it can run Python too. You can checkout the following link for pointers in parallelizing your code – PythonWordCount

Answered By: Snehal

Yes. You are right! I/O would cost most of your processing time. I don’t suggest you to use distributed systems, like hadoop, for this task.

Your task could be done in a modest workstation. I am not an Python expert, I think it has support for asynchronous programming. In F#/.Net, the platform has well support for that. I was once doing an image processing job, loading 20K images on disk and transform them into feature vectors only costs several minutes in parallel.

all in all, load and process your data in parallel and save the result in memory (if small), in database (if big).

Answered By: Yin Zhu

With terabytes, you will want to parallelize your reads over many disks anyway; so might as well go straight into Hadoop.

Use Pig or Hive to query the data; both have extensive support for user-defined transformations, so you should be able to implement what you need to do using custom code.

Answered By: SquareCog

Ok, so just to be different, why not R?

  • You seem to know R so you may get to working code quickly
  • 30 mb per file is not large on standard workstation with a few gb of ram
  • the read.csv() variant of read.table() can be very efficient if you specify the types of columns via the colClasses argument: instead of guestimating types for conversion, these will handled efficiently
  • the bottleneck here is i/o from the disk and that is the same for every language
  • R has multicore to set up parallel processing on machines with multiple core (similar to Python’s multiprocessing, it seems)
  • Should you want to employ the ’embarrassingly parallel’ structure of the problem, R has several packages that are well-suited to data-parallel problems: E.g. snow and foreach can each be deployed on just one machine, or on a set of networked machines.
Answered By: Dirk Eddelbuettel

When you say “accumulating” then solution (2) looks most suitable to problem.
After initial load up to database you only update database with new files (daily, weekly? depends how often you need this).

In cases (1) and (3) you need to process files each time (what was stated earlier as most time/resources-consuming), unless you find a way to stored results and update them with new files.

You could use R to process files from csv to, for example, SQLite database.

Answered By: Marek

I’ve had good luck using R with Hadoop on Amazon’s Elastic Map Reduce. With EMR you pay only for the computer time you use and AMZN takes care of spinning up and spinning down the instances. Exactly how to structure the job in EMR really depends on how your analysis workflow is structured. For example, are all the records needed for one job contained completely inside of each csv or do you needs bits from each csv to complete an analysis?

Here’s some resources you might find useful:

The problem I mentioned in my blog post is more one of being CPU bound, not IO bound. Your issues are more IO, but the tips on loading libraries and cachefiles might be useful.

While it’s tempting to try to shove this in/out of a relational database, I recommend carefully considering if you really need all the overhead of an RDB. If you don’t, then you may create a bottleneck and development challenge with no real reward.

Answered By: JD Long