Dr. David McClain of Refined Audiometrics sent me his code for interfacing with the Apple vDSP library. Here it is along with a Makefile that I put together for it: vdsp-wrapper.tar.gz. I have not actually run the code myself. I started to convert his Lispworks FLI code to CFFI code, but bailed on it in favor of making a minimal CFFI interface to FFTW.
Dr. McClain says that Apple’s vDSP library usually gains him 20% or so versus FFTW. Further, there may well be plenty of alignment issues that should be taken into account in allocating the foreign buffers. Regardless, here is my very basic wrapper around FFTW’s fftw_plan_dft_1d() function: fftw.lisp.
To get this working with SBCL on my Mac, I needed to get FFTW compiling 32-bit shared libraries. By default, it compiles 64-bit static libraries on my system.
% CC="gcc -arch i386" ./configure --prefix=/usr/local --enable-shared
Before I recompiled for 32-bit though, I manually moved the libfftw3.* files in /usr/local/lib into libfftw3_64.* files (of course, moving both the symlink libfftw3.dylib and its target).
Doing the FFI to FFTW does a 1,048,576-sample buffer in 0.202 seconds under SBCL 1.0.30 versus 0.72 seconds for my all-Lisp version. Under Clozure-64, it takes 0.91 seconds versus nearly 30 seconds for the all-Lisp version. I should check to make sure FFTW performs as well when my buffer is randomly filled instead of filled with a constant value. But, yep. There you have it.
So, certainly, if you need high-performance FFTs, you’ll want to FFI wrap something rather than use my all-Lisp version. If you don’t need such high-performance, I give you no-hassle portability.
CCL 30 secs?
For version 1.2 CCL vs SBCL was 9.77 secs vs 2.72 secs. Now it’s 0.72 vs 30?
That CCL number is 512×512 (2D).
That SBCL number is 512x512x16 (3D).
These numbers are v1.3 on a 1M (1D).
Apple, oranges, and bananas. v1.3 SBCL is 0.13 on a 512×512 while CCL is 2.39.
Is it because your optimizations were more sbcl-specific, or ccl is just that slow?
CCL doesn’t notice when I (declaim (ftype (function … …) foo)). So, I’ve had similar problems finding good ways to optimize CCL code in the past. I really haven’t put any effort yet into optimizing this better for CCL. Under CCL, this code conses a ton where it doesn’t cons at all under SBCL. So, certainly, there is work to be done to get the declarations happy with CCL.
The timings will probably be even more interesting for x86-64 builds, at least in the case of SBCL (definitely, for Bordeaux FFT) and I would expect for FFTW. I don’t know how CCL handles FP, but I wouldn’t be surprised if you also saw a speed-up there.
Are there x86-64 builds of SBCL for Mac? My Linux box is very slow.
Yes. Unfortunately, the version currently available on the website lacks some floating point improvement that were merged in for 1.0.30. You’ll have to build it yourself from the latest source (a 5-10 minute point and shoot process, except on 10.6 where it seems to be point and pray for some).