Trying to Unconfound Lisp Speeds

Trying to Unconfound Lisp Speeds June 24th, 2009
Patrick Stein

In the original version of my Optimizing Lisp Some More article, I did a bad comparison between SBCL and Clozure. SBCL supports two different ways to declare the arguments to a function. Clozure only supports one of those ways. As such, my declarations didn’t matter at all to Clozure.

I updated that post with new numbers after putting in both types of declarations. Clozure was much closer to SBCL. I then decided to expand the list to include CMUCL, Allegro (Personal), Lispworks (Personal), ECL, and CLISP. I failed to get GCL or ABCL up and running on my Mac, and Scieneer CL isn’t available for the Mac.

As it turns out, the Allegro and LispWorks versions that I have are heap limited. Thus, they spent a great deal of time cleaning up garbage. To try to even the playing field, I reworked the function to take the ret buffer as a third argument so that allocation is no longer inside the timing loop.

(declaim (ftype (function ((simple-array single-float (12))
(simple-array single-float (3))
(simple-array single-float (3)))
(simple-array single-float (3))) mvl*-na))

(defun mvl*-na (matrix vec ret)
(declare (type (simple-array single-float (12)) matrix)
(type (simple-array single-float (3)) vec)
(type (simple-array single-float (3)) ret)
(optimize (speed 3) (safety 0)))
(loop for jj fixnum from 0 below 3
do (let ((offset (* jj 4)))
(declare (type fixnum offset))
(setf (aref ret jj)
(+ (aref matrix (+ offset 3))
(loop for ii fixnum from 0 below 3
for kk fixnum from offset below (+ offset 3)
summing (* (aref vec ii)
(aref matrix kk))
of-type single-float)))))
ret)

(let ((matrixes (make-ring-of-matrixes '(12) 4096))
(vectors (make-ring-of-matrixes '(3) 4095))
(ret (make-array 3 :element-type 'single-float
:initial-element 0.0f0)))
(time (loop for jj fixnum from 1 to 10000000
for mm in matrixes
for vv in vectors
do (mvl*-na mm vv ret))))

Here are the results in terms of total user time, non-GC time, and bytes allocated:

	wall	non-GC	alloced
SBCL 1.0.29	0.444s	0.444s	0
CMUCL 19f	0.567s	0.567s	0
Clozure-64bit 1.3-r11936	1.272s	1.272s	?? 0
Clozure-32bit 1.3-r11936	5.009s	4.418s	1,200,000,000
Allegro 8.1 (Personal)	6.131s	2.120s	1,440,000,000
LispWorks 5.1 (Personal)	14.054s	??	3,360,000,480
ECL 9.6.2	33.009s	??	18,240,000,256
GNU CLISP 2.47	93.190s	77.356s	2,520,000,000

As you can see, I failed to keep most of the implementations from allocating things (especially, ECL). Intriguingly, the 32-bit Clozure allocates a bunch where the 64-bit Clozure doesn’t seem to do so. It looks like Allegro would be pretty competitive if it weren’t using all of this extra memory.

I’m not sure why any of them are allocating with this code. Do they allocate loop counters? loop sums? function parameters? I may delve into the assembly of some of them at a later time. But, at this point, I’m just going to focus on those that don’t cons when I’m not looking.

Nathan

View 2009-06-27

The consing likely comes from allocating boxed single floats in the generic AREF routines and/or the arithmetic routines. SBCL and CMUCL know how to inline AREF on single-float arrays and the arithmetic routines so they don’t have to cons. Allegro and Lispworks should be able to do that too; Lispworks might require (FLOAT 0) or similar. I should think Clozure CL can do that too, although the 32-bit version obviously doesn’t. I do know that Clozure has different boxed representations for single-floats in 32-bit vs. 64-bit (the 64-bit boxed representation doesn’t require allocating any extra memory), so maybe it’s not inlining, but just relying on the generic version, which happens to not cons on 64-bit implementations.

Reply
- patreplied:
  
  View 2009-06-28
  
  It does seem likely that it is boxing and unboxing floats that is causing the allocations. I would have hoped some of that could be done on the stack instead of in the heap, but….
  
  If I get some time in the near future, I may explore the respective documentation to see how one is “supposed to” do such things with minimal memory thrashing.
  
  Thanks….
  
  Reply
Jason Cornez

View 2009-06-30

In Allegro 8.1, try the following formulation instead. It avoids the boxing and results in no extra memory allocation. Hence it is quite a bit faster. Unless Iâve made a silly mistake, it should compute the same resultâ¦

-Jason

(defun mvl*-acl (matrix vec ret)
(declare (type (simple-array single-float (12)) matrix)
(type (simple-array single-float (3)) vec)
(type (simple-array single-float (3)) ret)
(optimize (speed 3) (safety 0)))
(loop for jj fixnum from 0 below 3
do (let ((offset (* jj 4)))
(declare (type fixnum offset))
(setf (aref ret jj) (aref matrix (+ offset 3)))
(loop for ii fixnum from 0 below 3
for kk fixnum from offset below (+ offset 3)
do (incf (aref ret jj) (* (aref vec ii)
(aref matrix kk))))))
ret)

Reply
- patreplied:
  
  View 2009-06-30
  
  Indeed, that is a great deal faster, and without allocating. Thank you. I think I will write the numbers up in a separate article. But, reworking it with (incf …) instead (and I used a local variable to avoid doing (aref ret jj) multiple times) resulted in 0.40 seconds for SBCL and 0.72 seconds for Allegro.
  
  Reply
Faheem Mitha

View 2012-06-05

You could use symbol-macrolet instead of using “a local variable to avoid doing (aref ret jj) multiple times”.

Regards, Faheem

Reply
- patreplied:
  
  View 2012-06-05
  
  Do you have reason to believe that symbol-macrolet would result in something the compiler felt safe optimizing in a way it wouldn’t if I had manually written the form multiple times?
  
  Reply
Faheem Mitha

View 2012-06-05

I misunderstood your point. You used a local variable for (aref ret jj) to avoid recalculating it multiple times, so as to improve performance. symbol-macrolet would be exactly equal to having the form written multiple times in terms of the code behavior.

Reply

nklein software

software development and consulting