Java vs Python on Hadoop

Question:

I am working on a project using Hadoop and it seems to natively incorporate Java and provide streaming support for Python. Is there is a significant performance impact to choosing one over the other? I am early enough in the process where I can go either way if there is a significant performance difference one way or the other.

Asked By: jnoss

Source

Answers:

Java is less dynamic than Python and more effort has been put into its VM, making it a faster language. Python is also held back by its Global Interpreter Lock, meaning it cannot push threads of a single process onto different core.

Whether this makes any significant difference depends on what you intend to do. I suspect both languages will work for you.

Answered By: David Crawshaw

With Python you’ll probably develop faster and with Java will definitely run faster.

Google “benchmarksgame” if you want to see some very accurate speed comparisons between all popular languages, but if I recall correctly you’re talking about 3-5x faster.

That said, few things are processor bound these days, so if you feel like you’d develop better with Python, have at it!

In reply to comment (how can java be faster than Python):

All languages are processed differently. Java is about the fastest after C & C++ (which can be as fast or up to 5x faster than java, but seems to average around 2x faster). The rest are from 2-5+ times slower. Python is one of the faster ones after Java. I’m guessing that C# is about as fast as Java or maybe faster, but the benchmarksgame only had Mono (which was a tad slower) because they don’t run it on windows.

Most of these claims are based on the computer language benchmarks game which tends to be pretty fair because advocates of/experts in each language tweak the test written in their specific language to ensure the code is well-targeted.

For example, this shows all tests with Java vs c++ and you can see the speed ranges from about equal to java being 3x slower (first column is between 1 and 3), and java uses much more memory!

Now this page shows java vs python (from the point of view of Python). So the speeds range from python being 2x slower than Java to 174x slower, python generally beats java in code size and memory usage though.

Another interesting point here–tests that allocated a lot of memory, Java actually performed significantly better than Python in memory size as well. I’m pretty sure java usually loses memory because of the overhead of the VM, but once that factors out, java is probably more efficient than most (again, except the C’s).

This is Python 3 by the way, the other python platform tested (Just called Python) faired much worse.

If you really wanted to know how it is faster, the VM is amazingly intelligent. It compiles to machine language AFTER running the code, so it knows what the most likely code paths are and optimizes for them. Memory allocation is an art–really useful in an OO language. It can perform some amazing run-time optimizations which no non-VM language can do. It can run in a pretty small memory footprint when forced to, and is a language of choice for embedded devices along with C/C++.

I worked on a Signal Analyzer for Agilent (think expensive o-scope) where nearly the entire thing (aside from the sampling) was done in Java. This includes drawing the screen including the trace (AWT) and interacting with the controls.

Currently I’m working on a project for all future cable boxes. The Guide along with most other apps will be written in Java.

Why wouldn’t it be faster than Python?

Answered By: Bill K

You can write Hadoop mapreduce transformations either as “streaming” or as a “custom jar”. If you use streaming, you can write your code in any language you like, including Python or C++. Your code will just read from STDIN and output to STDOUT. However, on hadoop versions before 0.21, hadoop streaming used to only stream text – not binary – to your processes. Therefore your files needed to be text files, unless you do some funky encoding transformations yourself. But now it appears a patch has been added that now allows the use of binary formats with hadoop streaming.

If you use a “custom jar” (i.e. you wrote your mapreduce code in Java or Scala using the hadoop libraries), then you will have access to functions that allow you to input and output binary (serialize in binary) from your streaming processes (and save the results to disk). So future runs will be much faster (depending on how much your binary format is smaller than your text format).

So if your hadoop job is going to be I/O bound, then the “custom jar” approach will be faster (since both Java is faster as previous posters have shown and reading from disk will also be faster).

But you have to ask yourself how valuable is your time. I find myself far more productive with python, and writing map-reduce that reads STDIN and writes to STDOUT is really straightforward. So I personally would recommend going the python route – even if you have to figure the binary encoding stuff out yourself. Since hadoop 0.21 handles non-utf8 byte arrays, and since there is a binary (byte array) alternative to use for python (http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/), which shows the python code only being about 25% slower than the “custom jar” java code, I would definitely go the python route.

Answered By: John Prior