Trying to Unconfound Lisp Speeds June 24th, 2009
Patrick Stein

In the original version of my Optimizing Lisp Some More article, I did a bad comparison between SBCL and Clozure. SBCL supports two different ways to declare the arguments to a function. Clozure only supports one of those ways. As such, my declarations didn’t matter at all to Clozure.

I updated that post with new numbers after putting in both types of declarations. Clozure was much closer to SBCL. I then decided to expand the list to include CMUCL, Allegro (Personal), Lispworks (Personal), ECL, and CLISP. I failed to get GCL or ABCL up and running on my Mac, and Scieneer CL isn’t available for the Mac.

As it turns out, the Allegro and LispWorks versions that I have are heap limited. Thus, they spent a great deal of time cleaning up garbage. To try to even the playing field, I reworked the function to take the ret buffer as a third argument so that allocation is no longer inside the timing loop.

(declaim (ftype (function ((simple-array single-float (12))
                           (simple-array single-float (3))
                           (simple-array single-float (3)))
                          (simple-array single-float (3))) mvl*-na))
(defun mvl*-na (matrix vec ret)
  (declare (type (simple-array single-float (12)) matrix)
           (type (simple-array single-float (3)) vec)
           (type (simple-array single-float (3)) ret)
           (optimize (speed 3) (safety 0)))
  (loop for jj fixnum from 0 below 3
     do (let ((offset (* jj 4)))
          (declare (type fixnum offset))
          (setf (aref ret jj)
                (+ (aref matrix (+ offset 3))
                   (loop for ii fixnum from 0 below 3
                      for kk fixnum from offset below (+ offset 3)
                      summing (* (aref vec ii)
                                 (aref matrix kk))
                      of-type single-float)))))
(let ((matrixes (make-ring-of-matrixes '(12) 4096))
      (vectors  (make-ring-of-matrixes '(3) 4095))
      (ret      (make-array 3 :element-type 'single-float
                              :initial-element 0.0f0)))
  (time (loop for jj fixnum from 1 to 10000000
           for mm in matrixes
           for vv in vectors
           do (mvl*-na mm vv ret))))

Here are the results in terms of total user time, non-GC time, and bytes allocated:

  wall non-GC alloced
SBCL 1.0.29 0.444s 0.444s 0
CMUCL 19f 0.567s 0.567s 0
Clozure-64bit 1.3-r11936 1.272s 1.272s ?? 0
Clozure-32bit 1.3-r11936 5.009s 4.418s 1,200,000,000
Allegro 8.1 (Personal) 6.131s 2.120s 1,440,000,000
LispWorks 5.1 (Personal) 14.054s ?? 3,360,000,480
ECL 9.6.2 33.009s ?? 18,240,000,256
GNU CLISP 2.47 93.190s 77.356s 2,520,000,000

As you can see, I failed to keep most of the implementations from allocating things (especially, ECL). Intriguingly, the 32-bit Clozure allocates a bunch where the 64-bit Clozure doesn’t seem to do so. It looks like Allegro would be pretty competitive if it weren’t using all of this extra memory.

I’m not sure why any of them are allocating with this code. Do they allocate loop counters? loop sums? function parameters? I may delve into the assembly of some of them at a later time. But, at this point, I’m just going to focus on those that don’t cons when I’m not looking.