Speedy Matrix Multiplication in Lisp, Again… June 30th, 2009
Patrick Stein

In response to my earlier post about hidden memory allocations, Jason Cornez pointed out how to rework my loop so that Allegro wouldn’t need to allocate memory. The new structure avoids using the summing keyword in (loop …) and doesn’t use the return-value from (loop …).

(declaim (ftype (function ((simple-array single-float (12))
                           (simple-array single-float (3))
                           (simple-array single-float (3)))
                          (simple-array single-float (3))) mvl*-na))
(defun mvl*-na (matrix vec ret)
  (declare (type (simple-array single-float (12)) matrix)
           (type (simple-array single-float (3)) vec)
           (type (simple-array single-float (3)) ret)
           (optimize (speed 3) (safety 0)))
  (loop for jj fixnum from 0 below 3
     do (let ((offset (* jj 4)))
          (declare (type fixnum offset))
          (let ((current (aref matrix (+ offset 3))))
            (declare (type single-float current))
            (loop for ii fixnum from 0 below 3
               for kk fixnum from offset below (+ offset 3)
               do (incf current
                        (* (aref vec ii)
                           (aref matrix kk))))
            (setf (aref ret jj) current))))
  ret)

Now, the Allegro moved up in the rankings. All of the implementations (except 32-bit Clozure) ran slightly faster with this version. Allegro improved the most here though as its allocation/garbage dropped to nothing.

  wall non-GC alloced
SBCL 1.0.29 0.406s 0.406s 0
CMUCL 19f 0.539s 0.539s 32
Allegro 8.1 (Personal) 0.720s 0.720s 0
Clozure-64bit 1.3-r11936 1.100s 1.100s ?? 0
Clozure-32bit 1.3-r11936 6.884s 5.023s 1,680,000,000
LispWorks 5.1 (Personal) 12.587s ?? 3,120,000,516
ECL 9.6.2 30.875s ?? 17,280,000,256
GNU CLISP 2.47 88.008s 74.455s 2,160,000,000

CMUCL compiled with 30 Notes. I should look into those some time. It still ran quite fast though.

Allegro definitely liked this version better. It kept from allocating anything.

For comparison, I am including the previous runtimes here as well:

  wall non-GC alloced
SBCL 1.0.29 0.444s 0.444s 0
CMUCL 19f 0.567s 0.567s 0
Clozure-64bit 1.3-r11936 1.272s 1.272s ?? 0
Clozure-32bit 1.3-r11936 5.009s 4.418s 1,200,000,000
Allegro 8.1 (Personal) 6.131s 2.120s 1,440,000,000
LispWorks 5.1 (Personal) 14.054s ?? 3,360,000,480
ECL 9.6.2 33.009s ?? 18,240,000,256
GNU CLISP 2.47 93.190s 77.356s 2,520,000,000

To CLOS or not to CLOS June 29th, 2009
Patrick Stein

I am working on some Lisp code. I am trying to mimic the basic structure of a large C++ project. I think the way the C++ project is structured is a good fit for the tasks involved.

Most of the C++ stuff is done with classes. Most of the methods of those classes are virtual methods. Many of the methods will be called multiple times every hundredth of a second. Very few of the virtual methods ever get overridden by subclasses.

So, I got to thinking. Does it make sense for me to use CLOS stuff at all for most of this? Would it be significantly faster to use (defstruct …) and (defun …) instead of (defclass …) and (defgeneric …)?

My gut instinct was: Yes. My first set of test code didn’t bear that out. For both the classes and the structs, I used (MAKE-INSTANCE …) to allocate and (WITH-SLOTS …) to access. When doing so, the classes with generic functions took about 1.5 seconds for 10,000,000 iterations under SBCL while the structs with non-generic functions took about 2.2 seconds.

For the next iteration of testing, I decided to use accessors instead of slots. I had the following defined:

(defclass base-class ()
  ((slot-one :initform (random 1.0f0) :type single-float
             :accessor class-slot-one)
   (slot-two :initform (random 1.0f0) :type single-float
             :accessor class-slot-two)))

(defclass sub-class (base-class)
  ((slot-three :initform (random 1.0f0) :type single-float
               :accessor class-slot-three)
   (slot-four  :initform (random 1.0f0) :type single-float
               :accessor class-slot-four)))

(defstruct (base-struct)
  (slot-one (random 1.0f0) :type single-float)
  (slot-two (random 1.0f0) :type single-float))

(defstruct (sub-struct (:include base-struct))
  (slot-three (random 1.0f0) :type single-float)
  (slot-four (random 1.0f0) :type single-float))

I then switched from using calls like (slot-value instance ‘slot-three) in my functions to using calls like (class-slot-three instance) or (sub-struct-slot-three instance). Now, 10,000,000 iterations took 2.6 seconds for the classes and 0.3 seconds for the structs in SBCL. In Clozure (64-bit), 10,000,000 iterations with classes and with method-dispatch and with accessors took 11.0 seconds, with classess and without method-dispatch but with accessors took 6.1 seconds, and with structs and without method-dispatch and with accessors took 0.4 seconds.

There is much more that I can explore to tune this code. But, for now, I think the answer is fairly easy. I am going to use structs, functions, and accessors for as many cases as I can. Here is the generic-function, class, struct test code with accessors. The first loop uses classes and generic functions. The second loop uses classes and regular functions. The third loop uses structs and regular functions.

Trying to Unconfound Lisp Speeds June 24th, 2009
Patrick Stein

In the original version of my Optimizing Lisp Some More article, I did a bad comparison between SBCL and Clozure. SBCL supports two different ways to declare the arguments to a function. Clozure only supports one of those ways. As such, my declarations didn’t matter at all to Clozure.

I updated that post with new numbers after putting in both types of declarations. Clozure was much closer to SBCL. I then decided to expand the list to include CMUCL, Allegro (Personal), Lispworks (Personal), ECL, and CLISP. I failed to get GCL or ABCL up and running on my Mac, and Scieneer CL isn’t available for the Mac.

As it turns out, the Allegro and LispWorks versions that I have are heap limited. Thus, they spent a great deal of time cleaning up garbage. To try to even the playing field, I reworked the function to take the ret buffer as a third argument so that allocation is no longer inside the timing loop.

(declaim (ftype (function ((simple-array single-float (12))
                           (simple-array single-float (3))
                           (simple-array single-float (3)))
                          (simple-array single-float (3))) mvl*-na))
(defun mvl*-na (matrix vec ret)
  (declare (type (simple-array single-float (12)) matrix)
           (type (simple-array single-float (3)) vec)
           (type (simple-array single-float (3)) ret)
           (optimize (speed 3) (safety 0)))
  (loop for jj fixnum from 0 below 3
     do (let ((offset (* jj 4)))
          (declare (type fixnum offset))
          (setf (aref ret jj)
                (+ (aref matrix (+ offset 3))
                   (loop for ii fixnum from 0 below 3
                      for kk fixnum from offset below (+ offset 3)
                      summing (* (aref vec ii)
                                 (aref matrix kk))
                      of-type single-float)))))
  ret)
(let ((matrixes (make-ring-of-matrixes '(12) 4096))
      (vectors  (make-ring-of-matrixes '(3) 4095))
      (ret      (make-array 3 :element-type 'single-float
                              :initial-element 0.0f0)))
  (time (loop for jj fixnum from 1 to 10000000
           for mm in matrixes
           for vv in vectors
           do (mvl*-na mm vv ret))))

Here are the results in terms of total user time, non-GC time, and bytes allocated:

  wall non-GC alloced
SBCL 1.0.29 0.444s 0.444s 0
CMUCL 19f 0.567s 0.567s 0
Clozure-64bit 1.3-r11936 1.272s 1.272s ?? 0
Clozure-32bit 1.3-r11936 5.009s 4.418s 1,200,000,000
Allegro 8.1 (Personal) 6.131s 2.120s 1,440,000,000
LispWorks 5.1 (Personal) 14.054s ?? 3,360,000,480
ECL 9.6.2 33.009s ?? 18,240,000,256
GNU CLISP 2.47 93.190s 77.356s 2,520,000,000

As you can see, I failed to keep most of the implementations from allocating things (especially, ECL). Intriguingly, the 32-bit Clozure allocates a bunch where the 64-bit Clozure doesn’t seem to do so. It looks like Allegro would be pretty competitive if it weren’t using all of this extra memory.

I’m not sure why any of them are allocating with this code. Do they allocate loop counters? loop sums? function parameters? I may delve into the assembly of some of them at a later time. But, at this point, I’m just going to focus on those that don’t cons when I’m not looking.

Even More Optimization June 23rd, 2009
Patrick Stein

In the previous article, I showed a chunk of code for multiplying a vector by a matrix. Nikodemus Silvola, on the sbcl-help mailing list, recommended that I change the matrix to a one-dimensional array instead of a two dimensional array.

Doing that takes me from ten million iterations in 1.038 seconds to ten million iterations in 0.645 seconds. That brings me an overall improvement of 14x from the original, untyped code.

Thanks, all!

For completeness, here is the final code:

(declaim (ftype (function ((simple-array single-float (12))
                           (simple-array single-float (3)))
                          (simple-array single-float (3))) mvl*))
(defun mvl* (matrix vec)
  (declare (optimize (speed 3) (safety 0)))
  (let ((ret (make-array 3 :initial-element 0.0f0
                           :element-type 'single-float)))
    (loop for jj from 0 below 3
       do (let ((offset (* jj 4)))
            (setf (aref ret jj)
                  (+ (aref matrix (+ offset 3))
                     (loop for ii from 0 below 3
                        for kk from offset below (+ offset 3)
                        summing (* (aref vec ii)
                                   (aref matrix kk))
                        single-float)))))
    ret))

To test this code, I made a ring of 4096 random matrices and a ring of 4095 random vectors. Then, I cycled through each ring as I iterated.

(let ((matrixes (make-ring-of-matrixes '(12) 4096))
      (vectors  (make-ring-of-matrixes '(3) 4095)))
  (time (loop for jj from 1 to 10000000
           for mm in matrixes
           for vv in vectors
           do (mvl* mm vv))))

Making a random matrix is simple enough.

(defun random-matrix (dims)
  (let ((ret (make-array dims :element-type 'single-float)))
    (loop for jj from 0 below (array-total-size ret)
         do (setf (row-major-aref ret jj) (random 1.0f0)))
    ret))

Making a bunch of them into a ring takes slightly more effort (in the form of tail recursion), but not much.

(defun make-ring-of-matrixes (dims count &optional ring first)
  (cond
    ((zerop count) (setf (cdr first) ring))
    (t (let ((mm (cons (random-matrix dims) ring)))
         (make-ring-of-matrixes dims (1- count) mm (or first mm))))))

Optimizing Lisp Some More June 22nd, 2009
Patrick Stein

Today, I spent a large chunk of the day trying optimize a very small chunk of code. The code takes a matrix with three rows and four columns. It multiplies it by a vector that has three rows (and a pretend fourth row that is set to one). Here is the code that I started from:

(defun mv*-nodecl (matrix vec)
  (let ((ret (make-array 3 :initial-element 0.0f0
                           :element-type 'single-float)))
    (loop for jj from 0 below 3
       do (setf (aref ret jj)
                (+ (aref matrix jj 3)
                   (loop for ii from 0 below 3
                      summing (* (aref vec ii)
                                 (aref matrix jj ii))))))
    ret))

I can do four million iterations of this function in 4.113 seconds. As such, it doesn’t seem like it’s necessarily the highest priority to optimize. But, my Lisp optimizing fu is not so great. This, being a small function, seemed like a good case study.

Compiler optimization without any type declarations

First, we’ll just see how it does if we ask SBCL to optimize for speed and throw caution to the wind.

(defun mv*-nodecl-opt (matrix vec)
  (declare (optimize (speed 3) (safety 0))
  (let ((ret (make-array 3 :initial-element 0.0f0
                           :element-type 'single-float)))
    (loop for jj from 0 below 3
       do (setf (aref ret jj)
                (+ (aref matrix jj 3)
                   (loop for ii from 0 below 3
                      summing (* (aref vec ii)
                                 (aref matrix jj ii))))))
    ret))

The compiler spits out 38 notes telling me all of the stuff it failed to optimize for me. Then, it runs four million iterations in 3.944 seconds.

Type declarations without compiler optimizations

So, what can we tackle by declaring the types of the variables and the intermediate values?

(declaim (ftype (function ((simple-array single-float (3 4))
                           (simple-array single-float (3)))
                          (simple-array single-float (3))) mv*-noopt))
(defun mv*-noopt (matrix vec)
  (let ((ret (make-array 3 :initial-element 0.0f0
                           :element-type 'single-float)))
    (loop for jj from 0 below 3
       do (setf (aref ret jj)
                (+ (aref matrix jj 3)
                   (loop for ii from 0 below 3
                      summing (* (aref vec ii)
                                 (aref matrix jj ii))
                      single-float))))
    ret))

Now, I can do four million iterations in 0.344 seconds. That’s better by a factor of eleven.

Both compiler optimizations and type declarations

Now, if I turn on compiler optimizations and turn off compiler safety:

(declaim (ftype (function ((simple-array single-float (3 4))
                           (simple-array single-float (3)))
                          (simple-array single-float (3))) mv*))
(defun mv* (matrix vec)
  (declare (optimize (speed 3) (safety 0)))
  (let ((ret (make-array 3 :initial-element 0.0f0
                           :element-type 'single-float)))
    (loop for jj from 0 below 3
       do (setf (aref ret jj)
                (+ (aref matrix jj 3)
                   (loop for ii from 0 below 3
                      summing (* (aref vec ii)
                                 (aref matrix jj ii))
                      single-float))))
    ret))

With that, I am up to 0.394 seconds for four million iterations. I’m not sure why the compiler optimization slows me down slightly. So, I think I am going to nuke the speed request at the moment and call it a day.

Other Lisp Implementations

Justin Grant pointed out in a comment below that Clozure expects the declarations of types to be inside the function instead of externally as I did using the (declaim (ftype …)). So, the numbers in the following paragraph are slightly skewed.

SBCL cleaned Clozure‘s clock. The fully typed and optimized version took 6.272 seconds with 32-bit Clozure while the untyped, unoptimized version took 6.405 seconds. The fully typed and optimized version took 2.029 seconds with 64-bit Clozure while the untyped, unoptimized version too 2.129 seconds. It is hard to say how much of that is overhead with the one to four million loop.

CLisp took 158.738 seconds for the fully typed and optimized version. I didn’t bother letting it run the others.

Here are the numbers for the Lisps that I have running on my Mac with ten million iterations:

  full-decls,
2D-matrix
full-decls,
1D-matrix
no decls,
2D-matrix
SBCL 1.0.29 1.040s 0.633s 8.903s
CMUCL 19f 1.110s 1.960s 25.250s
Clozure-64bit 1.3-r11936 1.775s 1.949s 5.290s
Clozure-32bit 1.3-r11936 8.806s 5.492s 11.455s
Allegro 8.1 (Personal) 8.044s 8.270s 43.220s
LispWorks 5.1 (Personal) 14.092s 14.548s 20.369s
ECL 9.6.2 40.143s 43.651s 38.765s
GNU CLISP 2.47 365.477s 382.488s 362.511s

Interesting tidbits from the above:

  • For the no declarations case, Allegro says it allocated -1,549,934,592 other bytes. I am assuming that it just crossed over 2^{31} and it really meant 2,745,032,704.
  • LispWorks did about 40% better when I selected Compile and Load from the File menu instead of typing (load (compile-file “tm.lisp”)) at the REPL.
  • I tried, but failed, to get GCL and ABCL installed. I may have to try those again some other time.
  • Except for CLISP and ECL, the implementations all allocated significantly more memory for the no-decls case. SBCL allocated ten times as much in the no-decls case. CLISP allocated the same in all cases. ECL allocated 20% less in the no-decls case.

Important Lessons

I ran into all sorts of compiler notes and warnings and errors while trying to optimize this code. I was trying to use (SVREF …) to reference into an array. SVREF means simple-vector reference. I am not allowed to use it on an array. I hadn’t realized before that in a simple-vector, the elements are required to have unspecified type. For a vector or a simple-array, you can specify the type of the elements. For a simple-vector, you cannot. Richard Krueter, on the sbcl-help mailing list, set me straight.

My other big stumbling block was this. The following two lines mean exactly the same thing (they both create an array of three zeros):

(make-array 3 :initial-element 0.0f0 :element-type 'single-float)
(make-array '(3) :initial-element 0.0f0 :element-type 'single-float)

If I wanted a multidimensional array, I would need to use the second form there and put a list of numbers in where the three is. For a one-dimensional array, I can use either the length as a number or a list with the length as its only element.

On the other hand, the following two lines mean very different things:

(declare (type (simple-array single-float 3) foo))
(declare (type (simple-array single-float (3)) foo))

The former means that foo is a three-dimensional array of single-precision floating point numbers with the length of each array dimension left unspecified. The latter means that foo is a one-dimensional array with three single-precision floating point elements.

l