In response to my earlier post about hidden memory allocations, Jason Cornez pointed out how to rework my loop so that Allegro wouldn’t need to allocate memory. The new structure avoids using the summing keyword in (loop …) and doesn’t use the return-value from (loop …).
(declaim (ftype (function ((simple-array single-float (12))
(simple-array single-float (3))
(simple-array single-float (3)))
(simple-array single-float (3))) mvl*-na))
(simple-array single-float (3))
(simple-array single-float (3)))
(simple-array single-float (3))) mvl*-na))
(defun mvl*-na (matrix vec ret)
(declare (type (simple-array single-float (12)) matrix)
(type (simple-array single-float (3)) vec)
(type (simple-array single-float (3)) ret)
(optimize (speed 3) (safety 0)))
(loop for jj fixnum from 0 below 3
do (let ((offset (* jj 4)))
(declare (type fixnum offset))
(let ((current (aref matrix (+ offset 3))))
(declare (type single-float current))
(loop for ii fixnum from 0 below 3
for kk fixnum from offset below (+ offset 3)
do (incf current
(* (aref vec ii)
(aref matrix kk))))
(setf (aref ret jj) current))))
ret)
(declare (type (simple-array single-float (12)) matrix)
(type (simple-array single-float (3)) vec)
(type (simple-array single-float (3)) ret)
(optimize (speed 3) (safety 0)))
(loop for jj fixnum from 0 below 3
do (let ((offset (* jj 4)))
(declare (type fixnum offset))
(let ((current (aref matrix (+ offset 3))))
(declare (type single-float current))
(loop for ii fixnum from 0 below 3
for kk fixnum from offset below (+ offset 3)
do (incf current
(* (aref vec ii)
(aref matrix kk))))
(setf (aref ret jj) current))))
ret)
Now, the Allegro moved up in the rankings. All of the implementations (except 32-bit Clozure) ran slightly faster with this version. Allegro improved the most here though as its allocation/garbage dropped to nothing.
wall | non-GC | alloced | |
SBCL 1.0.29 | 0.406s | 0.406s | 0 |
CMUCL 19f | 0.539s | 0.539s | 32 |
Allegro 8.1 (Personal) | 0.720s | 0.720s | 0 |
Clozure-64bit 1.3-r11936 | 1.100s | 1.100s | ?? 0 |
Clozure-32bit 1.3-r11936 | 6.884s | 5.023s | 1,680,000,000 |
LispWorks 5.1 (Personal) | 12.587s | ?? | 3,120,000,516 |
ECL 9.6.2 | 30.875s | ?? | 17,280,000,256 |
GNU CLISP 2.47 | 88.008s | 74.455s | 2,160,000,000 |
CMUCL compiled with 30 Notes. I should look into those some time. It still ran quite fast though.
Allegro definitely liked this version better. It kept from allocating anything.
For comparison, I am including the previous runtimes here as well:
wall | non-GC | alloced | |
SBCL 1.0.29 | 0.444s | 0.444s | 0 |
CMUCL 19f | 0.567s | 0.567s | 0 |
Clozure-64bit 1.3-r11936 | 1.272s | 1.272s | ?? 0 |
Clozure-32bit 1.3-r11936 | 5.009s | 4.418s | 1,200,000,000 |
Allegro 8.1 (Personal) | 6.131s | 2.120s | 1,440,000,000 |
LispWorks 5.1 (Personal) | 14.054s | ?? | 3,360,000,480 |
ECL 9.6.2 | 33.009s | ?? | 18,240,000,256 |
GNU CLISP 2.47 | 93.190s | 77.356s | 2,520,000,000 |
As malkia noted in another thread, LispWorks has a few additional flags that keep it from wrapping single-precision floats so much. With the addition of (float 0) (safety 0), the LispWorks time drops to 0.562 seconds with only 516 bytes allocated.
For comparison I have created a simple D (V1) version; it uses a matrix and an array, it’s not the faster possible version.
You may compile it yourself with LDC (codepad runs it, but it doesn’t use any optimization flags):
http://codepad.org/SxUHubIn
Email me if you have questions.