Today, I spent a large chunk of the day trying optimize a very small chunk of code. The code takes a matrix with three rows and four columns. It multiplies it by a vector that has three rows (and a pretend fourth row that is set to one). Here is the code that I started from:
(defun mv*-nodecl (matrix vec)
(let ((ret (make-array 3 :initial-element 0.0f0
:element-type 'single-float)))
(loop for jj from 0 below 3
do (setf (aref ret jj)
(+ (aref matrix jj 3)
(loop for ii from 0 below 3
summing (* (aref vec ii)
(aref matrix jj ii))))))
ret))
I can do four million iterations of this function in 4.113 seconds. As such, it doesn’t seem like it’s necessarily the highest priority to optimize. But, my Lisp optimizing fu is not so great. This, being a small function, seemed like a good case study.
Compiler optimization without any type declarations
First, we’ll just see how it does if we ask SBCL to optimize for speed and throw caution to the wind.
(defun mv*-nodecl-opt (matrix vec)
(declare (optimize (speed 3) (safety 0))
(let ((ret (make-array 3 :initial-element 0.0f0
:element-type 'single-float)))
(loop for jj from 0 below 3
do (setf (aref ret jj)
(+ (aref matrix jj 3)
(loop for ii from 0 below 3
summing (* (aref vec ii)
(aref matrix jj ii))))))
ret))
The compiler spits out 38 notes telling me all of the stuff it failed to optimize for me. Then, it runs four million iterations in 3.944 seconds.
Type declarations without compiler optimizations
So, what can we tackle by declaring the types of the variables and the intermediate values?
(declaim (ftype (function ((simple-array single-float (3 4))
(simple-array single-float (3)))
(simple-array single-float (3))) mv*-noopt))
(defun mv*-noopt (matrix vec)
(let ((ret (make-array 3 :initial-element 0.0f0
:element-type 'single-float)))
(loop for jj from 0 below 3
do (setf (aref ret jj)
(+ (aref matrix jj 3)
(loop for ii from 0 below 3
summing (* (aref vec ii)
(aref matrix jj ii))
single-float))))
ret))
Now, I can do four million iterations in 0.344 seconds. That’s better by a factor of eleven.
Both compiler optimizations and type declarations
Now, if I turn on compiler optimizations and turn off compiler safety:
(declaim (ftype (function ((simple-array single-float (3 4))
(simple-array single-float (3)))
(simple-array single-float (3))) mv*))
(defun mv* (matrix vec)
(declare (optimize (speed 3) (safety 0)))
(let ((ret (make-array 3 :initial-element 0.0f0
:element-type 'single-float)))
(loop for jj from 0 below 3
do (setf (aref ret jj)
(+ (aref matrix jj 3)
(loop for ii from 0 below 3
summing (* (aref vec ii)
(aref matrix jj ii))
single-float))))
ret))
With that, I am up to 0.394 seconds for four million iterations. I’m not sure why the compiler optimization slows me down slightly. So, I think I am going to nuke the speed request at the moment and call it a day.
Other Lisp Implementations
Justin Grant pointed out in a comment below that Clozure expects the declarations of types to be inside the function instead of externally as I did using the (declaim (ftype …)). So, the numbers in the following paragraph are slightly skewed.
SBCL cleaned Clozure‘s clock. The fully typed and optimized version took 6.272 seconds with 32-bit Clozure while the untyped, unoptimized version took 6.405 seconds. The fully typed and optimized version took 2.029 seconds with 64-bit Clozure while the untyped, unoptimized version too 2.129 seconds. It is hard to say how much of that is overhead with the one to four million loop.
CLisp took 158.738 seconds for the fully typed and optimized version. I didn’t bother letting it run the others.
Here are the numbers for the Lisps that I have running on my Mac with ten million iterations:
|
full-decls, 2D-matrix |
full-decls, 1D-matrix |
no decls, 2D-matrix |
SBCL 1.0.29 |
1.040s |
0.633s |
8.903s |
CMUCL 19f |
1.110s |
1.960s |
25.250s |
Clozure-64bit 1.3-r11936 |
1.775s |
1.949s |
5.290s |
Clozure-32bit 1.3-r11936 |
8.806s |
5.492s |
11.455s |
Allegro 8.1 (Personal) |
8.044s |
8.270s |
43.220s |
LispWorks 5.1 (Personal) |
14.092s |
14.548s |
20.369s |
ECL 9.6.2 |
40.143s |
43.651s |
38.765s |
GNU CLISP 2.47 |
365.477s |
382.488s |
362.511s |
Interesting tidbits from the above:
- For the no declarations case, Allegro says it allocated
-1,549,934,592 other bytes
. I am assuming that it just crossed over and it really meant 2,745,032,704.
- LispWorks did about 40% better when I selected Compile and Load from the File menu instead of typing (load (compile-file “tm.lisp”)) at the REPL.
- I tried, but failed, to get GCL and ABCL installed. I may have to try those again some other time.
- Except for CLISP and ECL, the implementations all allocated significantly more memory for the no-decls case. SBCL allocated ten times as much in the no-decls case. CLISP allocated the same in all cases. ECL allocated 20% less in the no-decls case.
Important Lessons
I ran into all sorts of compiler notes and warnings and errors while trying to optimize this code. I was trying to use (SVREF …) to reference into an array. SVREF means simple-vector reference. I am not allowed to use it on an array. I hadn’t realized before that in a simple-vector, the elements are required to have unspecified type. For a vector or a simple-array, you can specify the type of the elements. For a simple-vector, you cannot. Richard Krueter, on the sbcl-help mailing list, set me straight.
My other big stumbling block was this. The following two lines mean exactly the same thing (they both create an array of three zeros):
(make-array 3 :initial-element 0.0f0 :element-type 'single-float)
(make-array '(3) :initial-element 0.0f0 :element-type 'single-float)
If I wanted a multidimensional array, I would need to use the second form there and put a list of numbers in where the three is. For a one-dimensional array, I can use either the length as a number or a list with the length as its only element.
On the other hand, the following two lines mean very different things:
(declare (type (simple-array single-float 3) foo))
(declare (type (simple-array single-float (3)) foo))
The former means that foo is a three-dimensional array of single-precision floating point numbers with the length of each array dimension left unspecified. The latter means that foo is a one-dimensional array with three single-precision floating point elements.