Today, I spent a large chunk of the day trying optimize a very small chunk of code. The code takes a matrix with three rows and four columns. It multiplies it by a vector that has three rows (and a pretend fourth row that is set to one). Here is the code that I started from:
(let ((ret (make-array 3 :initial-element 0.0f0
:element-type 'single-float)))
(loop for jj from 0 below 3
do (setf (aref ret jj)
(+ (aref matrix jj 3)
(loop for ii from 0 below 3
summing (* (aref vec ii)
(aref matrix jj ii))))))
ret))
I can do four million iterations of this function in 4.113 seconds. As such, it doesn’t seem like it’s necessarily the highest priority to optimize. But, my Lisp optimizing fu is not so great. This, being a small function, seemed like a good case study.
Compiler optimization without any type declarations
First, we’ll just see how it does if we ask SBCL to optimize for speed and throw caution to the wind.
(declare (optimize (speed 3) (safety 0))
(let ((ret (make-array 3 :initial-element 0.0f0
:element-type 'single-float)))
(loop for jj from 0 below 3
do (setf (aref ret jj)
(+ (aref matrix jj 3)
(loop for ii from 0 below 3
summing (* (aref vec ii)
(aref matrix jj ii))))))
ret))
The compiler spits out 38 notes telling me all of the stuff it failed to optimize for me. Then, it runs four million iterations in 3.944 seconds.
Type declarations without compiler optimizations
So, what can we tackle by declaring the types of the variables and the intermediate values?
(simple-array single-float (3)))
(simple-array single-float (3))) mv*-noopt))
(defun mv*-noopt (matrix vec)
(let ((ret (make-array 3 :initial-element 0.0f0
:element-type 'single-float)))
(loop for jj from 0 below 3
do (setf (aref ret jj)
(+ (aref matrix jj 3)
(loop for ii from 0 below 3
summing (* (aref vec ii)
(aref matrix jj ii))
single-float))))
ret))
Now, I can do four million iterations in 0.344 seconds. That’s better by a factor of eleven.
Both compiler optimizations and type declarations
Now, if I turn on compiler optimizations and turn off compiler safety:
(simple-array single-float (3)))
(simple-array single-float (3))) mv*))
(defun mv* (matrix vec)
(declare (optimize (speed 3) (safety 0)))
(let ((ret (make-array 3 :initial-element 0.0f0
:element-type 'single-float)))
(loop for jj from 0 below 3
do (setf (aref ret jj)
(+ (aref matrix jj 3)
(loop for ii from 0 below 3
summing (* (aref vec ii)
(aref matrix jj ii))
single-float))))
ret))
With that, I am up to 0.394 seconds for four million iterations. I’m not sure why the compiler optimization slows me down slightly. So, I think I am going to nuke the speed request at the moment and call it a day.
Other Lisp Implementations
Justin Grant pointed out in a comment below that Clozure expects the declarations of types to be inside the function instead of externally as I did using the (declaim (ftype …)). So, the numbers in the following paragraph are slightly skewed.
SBCL cleaned Clozure‘s clock. The fully typed and optimized version took 6.272 seconds with 32-bit Clozure while the untyped, unoptimized version took 6.405 seconds. The fully typed and optimized version took 2.029 seconds with 64-bit Clozure while the untyped, unoptimized version too 2.129 seconds. It is hard to say how much of that is overhead with the one to four million loop.
CLisp took 158.738 seconds for the fully typed and optimized version. I didn’t bother letting it run the others.
Here are the numbers for the Lisps that I have running on my Mac with ten million iterations:
full-decls, 2D-matrix |
full-decls, 1D-matrix |
no decls, 2D-matrix |
|
SBCL 1.0.29 | 1.040s | 0.633s | 8.903s |
CMUCL 19f | 1.110s | 1.960s | 25.250s |
Clozure-64bit 1.3-r11936 | 1.775s | 1.949s | 5.290s |
Clozure-32bit 1.3-r11936 | 8.806s | 5.492s | 11.455s |
Allegro 8.1 (Personal) | 8.044s | 8.270s | 43.220s |
LispWorks 5.1 (Personal) | 14.092s | 14.548s | 20.369s |
ECL 9.6.2 | 40.143s | 43.651s | 38.765s |
GNU CLISP 2.47 | 365.477s | 382.488s | 362.511s |
Interesting tidbits from the above:
- For the no declarations case, Allegro says it allocated
-1,549,934,592 other bytes
. I am assuming that it just crossed over and it really meant 2,745,032,704. - LispWorks did about 40% better when I selected Compile and Load from the File menu instead of typing (load (compile-file “tm.lisp”)) at the REPL.
- I tried, but failed, to get GCL and ABCL installed. I may have to try those again some other time.
- Except for CLISP and ECL, the implementations all allocated significantly more memory for the no-decls case. SBCL allocated ten times as much in the no-decls case. CLISP allocated the same in all cases. ECL allocated 20% less in the no-decls case.
Important Lessons
I ran into all sorts of compiler notes and warnings and errors while trying to optimize this code. I was trying to use (SVREF …) to reference into an array. SVREF means simple-vector reference. I am not allowed to use it on an array. I hadn’t realized before that in a simple-vector, the elements are required to have unspecified type. For a vector or a simple-array, you can specify the type of the elements. For a simple-vector, you cannot. Richard Krueter, on the sbcl-help mailing list, set me straight.
My other big stumbling block was this. The following two lines mean exactly the same thing (they both create an array of three zeros):
(make-array '(3) :initial-element 0.0f0 :element-type 'single-float)
If I wanted a multidimensional array, I would need to use the second form there and put a list of numbers in where the three is. For a one-dimensional array, I can use either the length as a number or a list with the length as its only element.
On the other hand, the following two lines mean very different things:
(declare (type (simple-array single-float (3)) foo))
The former means that foo is a three-dimensional array of single-precision floating point numbers with the length of each array dimension left unspecified. The latter means that foo is a one-dimensional array with three single-precision floating point elements.
[…] the previous article, I showed a chunk of code for multiplying a vector by a matrix. Nikodemus Silvola, on the sbcl-help […]
The code has not been optimized corrcetly for Clozure. Typing should be like this instead to compare SBCL and CLozure :
(declare (type (simple-array single-float (3 4)) matrix))
(declare (type (simple-array single-float (3)) vec))
(declare (optimize (speed 3) (safety 0)))
(let ((ret (make-array 3 :initial-element 0.0f0
:element-type 'single-float)))
(declare (type (simple-array single-float (3)) ret))
(loop for jj from 0 below 3
do (setf (aref ret jj)
(+ (aref matrix jj 3)
(loop for ii from 0 below 3
summing (* (aref vec ii)
(aref matrix jj ii))
single-float))))
ret))
Fair enough. I included the other declarations. I will update the timing numbers in the original post.
The numbers for Allegro and Lispworks seem almost unbelievably bad in comparison to SBCL, CMUCL and Clozure. I’m wondering if the type decls required for those are also different (they should not be).
Also, are these limited versions that you’re using (with heap limits etc.) ?
If they are then the numbers would be invalid for comparison.
I agree. I was surprised by the difference. I am using the free LW and a trial Allegro. They don’t advertise heap limits, IIRC.
My next steps are to type the loop counter variables and to try a version where ‘ret’ is allocated outside the timing form. Then, if there is still a big difference, I will delve into the assembly.
From the LW personal edition limitations :
http://www.lispworks.com/downloads/
“There is a heap size limit which, if exceeded, causes the image to exit. A warning is provided when the limit is approached. ”
Allegro CL Free Express edition :
http://www.franz.com/downloads/
“Limited heap size (60MB) ”
Would be interesting to see the code run on the full versions.
Aha. My bad. I will rerun with the memory allocated outside the timer.
I didn’t get a heap warning, but I probably spent a whole lot of time garbage-collecting.
I tried moving the allocation out of that function. There is still something that is making Allegro and LispWorks allocate memory inside that function or inside the loop. I haven’t looked at the assembly yet to try to figure out what they’re doing that’s causing them to chew through memory.
As you can see in the table in this article, several implementations allocated zero bytes for the same code that caused other implementations to allocate anywhere from 120 bytes to 1,842 bytes per iteration.
I do remember seeing the LispWorks thing about heap size, now that you mention it.
I’ve got to say though… it’s not a very promising way to get me to buy their product if the free edition is artificially slower. If their products turned out to be speedy enough to compete with the free implementations, I’d be much more inclined to shell out the money for them. As it is, I can’t really tell how competitive they are.
LispWorks 5-hour-per-session, no (save-image…), and non-commercial restrictions seem fine to me. The heap size restriction? that’s just running with a ball and chain around each ankle.
[…] the original version of my Optimizing Lisp Some More article, I did a bad comparison between SBCL and Clozure. SBCL supports two different ways to […]
For Lispworks you should make sure that you have this:
(debug 0) (compilation-speed 0)
#+lispworks (float 0)
#+lispworks (fixnum-safety 0)))
Indeed. That brought LispWorks down to 0.562 seconds and only 516 bytes allocated.
Thank you, very much.