Matrix Multiplication in Clojure vs Numpy

Question:

I’m working on an application in Clojure that needs to multiply large matrices and am running into some large performance issues compared to an identical Numpy version. Numpy seems to be able to multiply a 1,000,000×23 matrix by its transpose in under a second, while the equivalent clojure code takes over six minutes. (I can print out the resulting matrix from Numpy, so it’s definitely evaluating everything).

Am I doing something terribly wrong in this Clojure code? Is there some trick of Numpy that I can try to mimic?

Here’s the python:

import numpy as np

def test_my_mult(n):
    A = np.random.rand(n*23).reshape(n,23)
    At = A.T

    t0 = time.time()
    res = np.dot(A.T, A)
    print time.time() - t0
    print np.shape(res)

    return res

# Example (returns a 23x23 matrix):
# >>> results = test_my_mult(1000000)
# 
# 0.906938076019
# (23, 23)

And the clojure:

(defn feature-vec [n]
  (map (partial cons 1)
       (for [x (range n)]
         (take 22 (repeatedly rand)))))

(defn dot-product [x y]
  (reduce + (map * x y)))

(defn transpose
  "returns the transposition of a `coll` of vectors"
  [coll]
  (apply map vector coll))

(defn matrix-mult
  [mat1 mat2]
  (let [row-mult (fn [mat row]
                   (map (partial dot-product row)
                        (transpose mat)))]
    (map (partial row-mult mat2)
         mat1)))

(defn test-my-mult
  [n afn]
  (let [xs  (feature-vec n)
        xst (transpose xs)]
    (time (dorun (afn xst xs)))))

;; Example (yields a 23x23 matrix):
;; (test-my-mult 1000 i/mmult) => "Elapsed time: 32.626 msecs"
;; (test-my-mult 10000 i/mmult) => "Elapsed time: 628.841 msecs"

;; (test-my-mult 1000 matrix-mult) => "Elapsed time: 14.748 msecs"
;; (test-my-mult 10000 matrix-mult) => "Elapsed time: 434.128 msecs"
;; (test-my-mult 1000000 matrix-mult) => "Elapsed time: 375751.999 msecs"


;; Test from wikipedia
;; (def A [[14 9 3] [2 11 15] [0 12 17] [5 2 3]])
;; (def B [[12 25] [9 10] [8 5]])

;; user> (matrix-mult A B)
;; ((273 455) (243 235) (244 205) (102 160))

UPDATE: I implemented the same benchmark using the JBLAS library and found massive, massive speed improvements. Thanks to everyone for their input! Time to wrap this sucker in Clojure. Here’s the new code:

(import '[org.jblas FloatMatrix])

(defn feature-vec [n]
  (FloatMatrix.
   (into-array (for [x (range n)]
                 (float-array (cons 1 (take 22 (repeatedly rand))))))))

(defn test-mult [n]
  (let [xs  (feature-vec n)
        xst (.transpose xs)]
    (time (let [result (.mmul xst xs)]
            [(.rows result)
             (.columns result)]))))

;; user> (test-mult 10000)
;; "Elapsed time: 6.99 msecs"
;; [23 23]

;; user> (test-mult 100000)
;; "Elapsed time: 43.88 msecs"
;; [23 23]

;; user> (test-mult 1000000)
;; "Elapsed time: 383.439 msecs"
;; [23 23]

(defn matrix-stream [rows cols]
  (repeatedly #(FloatMatrix/randn rows cols)))

(defn square-benchmark
  "Times the multiplication of a square matrix."
  [n]
  (let [[a b c] (matrix-stream n n)]
    (time (.mmuli a b c))
    nil))

;; forma.matrix.jblas> (square-benchmark 10)
;; "Elapsed time: 0.113 msecs"
;; nil
;; forma.matrix.jblas> (square-benchmark 100)
;; "Elapsed time: 0.548 msecs"
;; nil
;; forma.matrix.jblas> (square-benchmark 1000)
;; "Elapsed time: 107.555 msecs"
;; nil
;; forma.matrix.jblas> (square-benchmark 2000)
;; "Elapsed time: 793.022 msecs"
;; nil
Asked By: Sam Ritchie

||

Answers:

The Python version is compiling down to a loop in C while the Clojure version is building a new intermediate sequence for each of the calls to map in this code. It is likely that the performance difference you see is coming from the difference of data structures.

To get better than this you could play with a library like Incanter or write your own version as explained in this SO question. see also this one, neanderthal or nd4j. If you really want to stay with sequences to keep the lazy evaluation properties etc. then you may get a real boost by looking into transients for the internal matrix calculations

EDIT: forgot to add the first step in tuning clojure, turn on “warn on reflection”

Answered By: Arthur Ulfeldt

Numpy is highly optimized for linear algebra. Certainly for large matrices, where most of the processing is in the native C code.

In order to match this performance (assuming its possible in Java) you would have to strip most of Clojure’s abstractions away: Don’t use map with anonymous functions when iterating over large matrices, add type hints to enable usage of raw Java arrays, etc.

Probably the best option is just to use a ready made Java library optimized for numerical computing (http://math.nist.gov/javanumerics/ or similar).

Answered By: nimrodm

I don’t have any specific answers for you; just some suggestions.

  1. Use a profiler to figure out where time is being spent
  2. set warn-on-reflection and use type hints where needed
  3. You may have to give up some high-level constructs and go with loop-recur to sqeeze out that last ounce of performance

IME, Clojure code should perform pretty close to Java (2 or 3X). But you have to work on it.

Answered By: PhilM

Numpy code uses built-in libraries, written in Fortran over the last few decades and optimized by the authors, your CPU vendor, and you OS distributor (as well as the Numpy people) for maximal performance. You just did the completely direct, obvious approach to matrix multiplication. It’s not surprise, really, that performance differs.

But if you’re insistant upon doing it in Clojure, consider looking up better algorithms, using direct loops as opposed to higher-order functions like reduce, or find a proper matrix algebra library for Java (I doubt there are good ones in Clojure, but I don’t really know) written by a competent mathematician.

Finally, look up how to properly write fast Clojure. Use type hints, run a profiler on your code (surprise! you dot product function is using up the most time), and drop the high-level features inside your tight loops.

Answered By: pavpanchekha

Numpy is linking to BLAS/Lapack routines that have been optimized for decades at the level of machine architecture while the Clojure is a implementing the multiplication in the most straightforward and naive manner.

Any time you have non-trivial matrix/vector operations to perform, you should probably link to BLAS/LAPACK.

The only time this won’t be faster is for small matrices from languages where the overhead of translating the data representation between the language runtime and the LAPACK exceed the time spent doing the calculation.

Answered By: littleidea

Only use map() if it makes sense. Which means: if you have a specific problem like multiplying two matrices, do not try and map() it, just multiply the matrices.

I tend to use map() only when it makes linguistic sense (i.e. if the program is really more readable than without it). Multiplying matrices is so obvious a loop that mapping it makes no sense.

Yours.

Pedro Fortuny.

Answered By: Pedro Fortuny.

As @littleidea and others have pointed out your numpy version is using LAPACK/BLAS/ATLAS which will be much faster than anything you do in clojure since it has been finely tuned for years. 🙂

That said the biggest problem with the Clojure code is that it is using Doubles, as in boxed doubles. I call this the “lazy Double” problem and I’ve ran into it at work a number of times. As of now, even with 1.3, clojure’s collections are not primitive friendly. (You can create a vector of primitives but it won’t help you any since all of the seq. functions will end up boxing them! I should also say that the primitive improvements in 1.3 are quite nice and end up helping.. we just aren’t 100% there WRT primitive support in collections.)

When doing any kind of matrix math in clojure you really need to use java arrays or, better yet, matrix libraries. Incanter does use parrelcolt but you need to be careful about what incanter functions you use… since a lot of them make the matrices seqable which ends up boxing the doubles giving you similar performance to what you are currently seeing. (BTW, I have my own set up parrelcolt wrappers that I could release if you think they would be helpful.)

In order to use the BLAS libraries you have a couple of options in java-land. With all of these options you have to pay a JNA tax… all of your data has to be copied before it can be processed. This tax makes sense when you are doing CPU bound operations like matrix decompositions and whose processing time takes longer than the time it takes to copy the data. For simpler operations with small matrices then staying in java-land will probably be faster. You just need to do a few tests like you’ve done above to see what works best for you.

Here are your options for using BLAS from java:

http://jblas.org/

http://code.google.com/p/netlib-java/

I should point out that parrelcolt uses the netlib-java project. Which means, I beleive, if you set it up correcly it will use BLAS. However, I have not verifed this. For an explaination on the differences between jblas and netlib-java see this thread I started about it on jblas’s mailing list:

http://groups.google.com/group/jblas-users/browse_thread/thread/c9b3867572331aa5

I should also point out the Universal Java Matrix Package library:

http://sourceforge.net/projects/ujmp/

It wraps all of the libraries I have mentioned, and then some! I haven’t looked too much at the API though to know how leaky their abstraction is. It seems like a nice project. I’ve ended up using my own parrelcolt clojure wrappers since they were fast enough and I actually quite liked the colt API. (Colt uses function objects, which means I was able to just pass clojure functions with little trouble!)

Answered By: Ben Mabey

If you want to do numerics in Clojure, I’d strongly recommend using Incanter rather than trying to roll your own matrix functions and suchlike.

Incanter uses Parallel Colt under the hood, which is pretty fast.

EDIT:

As of early 2013, if you want to do numerics in Clojure I strongly recommend checking out core.matrix

Answered By: mikera

I’ve just staged a small shootout between Incanter 1.3 and jBLAS 1.2.1. Here’s the code:

(ns ml-class.experiments.mmult
  [:use [incanter core]]
  [:import [org.jblas DoubleMatrix]])

(defn -main [m]
  (let [n 23 m (Integer/parseInt m)
        ai (matrix (vec (double-array (* m n) (repeatedly rand))) n)
        ab (DoubleMatrix/rand m n)
        ti (copy (trans ai))
        tb (.transpose ab)]
    (dotimes [i 20]
      (print "Incanter: ") (time (mmult ti ai))
      (print "   jBLAS: ") (time (.mmul tb ab)))))

In my test, Incanter is consistently slower than jBLAS by about 45% in plain matrix multiplication. However, Incanter trans function does not create a new copy of a matrix, and therefore (.mmul (.transpose ab) ab) in jBLAS takes twice as much memory and is only 15% faster than (mmult (trans ai) ai) in Incanter.

Given Incanter rich feature set (especially it’s plotting library), I don’t think I’ll switch to jBLAS any time soon. Still, I would love to see another shootout between jBLAS and Parallel Colt, and maybe it’s worth considering to replace Parallel Colt with jBLAS in Incanter? 🙂


EDIT: Here are absolute numbers (in msec.) I got on my (rather slow) PC:

Incanter: 665.362452
   jBLAS: 459.311598
   numpy: 353.777885

For each library, I’ve picked the best time out of 20 runs, matrix size 23×400000.

PS. Haskell hmatrix results are close to numpy, but I am not sure how to benchmark it correctly.

Answered By: Sergiy Matusevych
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.