XML Parser Generator March 16th, 2010
Patrick Stein

A few years back (for a very generous few), we needed to parse a wide variety of XML strings. It was quite tedious to go from the XML to the native-language representations of the data (even from a DOM version). Furthermore, we needed to parse this XML both in Java and in C++.

I wrote (in Java) an XML parser generator that took an XML description of how you’d like the native-language data structures to look and where in the XML it could find the values for those data structures. The Java code-base for this was ugly, ugly, ugly. I tried several times to clean it up into something publishable. I tried to clean it up several times so that it could actually generate the parser it used to read the XML description file. Alas, the meta-ness, combined with the clunky Java code, kept me from completing the circle.

Fast forward to last week. Suddenly, I have a reason to parse a wide variety of XML strings in Objective C. I certainly didn’t want to pull out the Java parser generator and try to beat it into generating Objective C, too. That’s fortunate, too, because I cannot find any of the copies (in various states of repair) that once lurked in ~/src.

What’s a man to do? Write it in Lisp, of course.

Example

Here’s an example to show how it works. Let’s take some simple XML that lists food items on a menu:

<menu>
        <food name="Belgian Waffles" price="$5.95" calories="650">
                <description>two of our famous Belgian Waffles with plenty of real maple syrup</description>
        </food>
        <!-- ... more food entries, omitted here for brevity ... -->
</menu>

We craft an XML description of how to go from the XML into a native representation:

<parser_generator root="menu" from="/menu">
  <struct name="food item">
    <field type="string" name="name" from="@name" />
    <field type="string" name="price" from="@price" />
    <field type="string" name="description" from="/description/." />
    <field type="integer" name="calories" from="@calories" />
  </struct>

  <struct name="menu">
    <field name="menu items">
      <array>
        <array_element type="food item" from="/food" />
      </array>
    </field>
  </struct>
</parser_generator>

Now, you run the parser generator on the above input file:

% sh parser-generator.sh --language=lisp \
                           --types-package menu \
                           --reader-package menu-reader \
                           --file menu.xml

This generates two files for you: types.lisp and reader.lisp. This is what types.lisp looks like:

(defpackage :menu
  (:use :common-lisp)
  (:export #:food-item
             #:name
             #:price
             #:description
             #:calories
           #:menu
             #:menu-items))

(in-package :menu)

(defclass food-item ()
  ((name :initarg :name :type string)
   (price :initarg :price :type string)
   (description :initarg :description :type string)
   (calories :initarg :calories :type integer)))

(defclass menu ()
  ((menu-items :initarg :menu-items :type list :initform nil)))

I will not bore you with all of reader.lisp as it’s 134 lines of code you never had to write. The only part you need to worry about is the parse function which takes a stream for or pathname to the XML and returns an instance of the menu class. Here is a small snippet though:

;;; =================================================================
;;; food-item struct
;;; =================================================================
(defmethod data progn ((handler sax-handler) (item food-item) path value)
  (with-slots (name price description calories) item
    (case path
      (:|@name| (setf name value))
      (:|@price| (setf price value))
      (:|/description/.| (setf description value))
      (:|@calories| (setf calories (parse-integer value))))))

Where it’s at

I currently have the parser generator generating its own parser (five times fast). I still have a little bit more that I’d like to add to include assertions for things like the minimum number of elements in an array or the minimum value of an integer. I also have a few kinks to work out so that you can return some type other than an instance of a class for cases like this where the menu class just wraps one item.

My next step though is to get it generating Objective C parsers.

Somewhere in there, I’ll post this to a public git repository.

Casting to Integers Considered Harmful August 6th, 2009
Patrick Stein

Background

Many years back, I wrote some ambient music generation code. The basic structure of the code is this: Take one queen and twenty or so drones in a thirty-two dimensional space. Give them each random positions and velocities. Limit the velocity and acceleration of the queen more than you limit the same for the drones. Now, select some point at random for the queen to target. Have the queen accelerate toward that target. Have the drones accelerate toward the queen. Use the average distance from the drones to the queens in the i-th dimension as the volume of the i-th note where the notes are logarithmically spaced across one octave. Clip negative volumes to zero. Every so often, or when the queen gets close to the target, give the queen a new target.

It makes for some interesting ambient noise that sounds a bit like movie space noises where the lumbering enemy battleship is looming in orbit as its center portion spins to create artificial gravity within.

I started working on an iPhone application based on this code. The original code was in C++. The conversion to Objective C was fairly straightforward and fairly painless (as I used the opportunity to try to correct my own faults by breaking things out into separate functions more often).

Visualization troubles

uniform
The original code though chose random positions and velocities from uniform distributions. The iPhone app is going to involve visualization as well as auralization. The picture at the right here is a plot of five thousand points with each coordinate selected from a uniform distribution with range [-20,+20]. Because each axis value is chosen independently, it looks very unnatural.

gauss
What to do? The obvious answer is to use Gaussian random variables instead of uniform ones. The picture at the right here is five thousand points with each coordinate selected from a Gaussian distribution with a standard-deviation of 10. As you can see, this is much more natural looking.

How did I generate the Gaussians?

I have usually used the Box-Muller method of generating two Gaussian-distributed random variables given two uniformly-distributed random variables:

(defun random-gaussian ()
  (let ((u1 (random 1.0))
        (u2 (random 1.0)))
    (let ((mag (sqrt (* -2.0 (log u1))))
          (ang (* 2.0 pi u2)))
      (values (* mag (cos ang))
              (* mag (sin ang))))))

But, I found an article online that shows a more numerically stable version:

(defun random-gaussian ()
  (flet ((pick-in-circle ()
           (loop as u1 = (random 1.0)
                as u2 = (random 1.0)
                as mag-squared = (+ (* u1 u1) (* u2 u2))
                when (< mag-squared 1.0)
                return (values u1 u2 mag-squared))))
    (multiple-value-bind (u1 u2 mag-squared) (pick-in-circle)
      (let ((ww (sqrt (/ (* -2.0 (log mag-squared)) mag-squared))))
        (values (* u1 ww)
                (* u2 ww))))))

For a quick sanity check, I thought, let’s just make sure it looks like a Gaussian. Here, I showed the code in Lisp, but the original code was in Objective-C. I figured, If I just change the function declaration, I can plop this into a short C program, run a few thousand trials into some histogram buckets, and see what I get.

The trouble with zero

So, here comes the problem with zero. I had the following main loop:

#define BUCKET_COUNT 33
#define STDDEV       8.0
#define ITERATIONS   100000

  for ( ii=0; ii < ITERATIONS; ++ii ) {
    int bb = val_to_bucket( STDDEV * gaussian() );
    if ( 0 <= bb && bb < BUCKET_COUNT ) {
      ++buckets[ bb ];
    }
  }

I now present you with three different implementations of the val_to_bucket() function.

int val_to_bucket( double _val ) {
  return (int)_val + ( BUCKET_COUNT / 2 );
}

int val_to_bucket( double _val ) {
  return (int)( _val + (int)( BUCKET_COUNT / 2 ) );
}

int val_to_bucket( double _val ) {
  return (int)( _val + (int)( BUCKET_COUNT / 2 ) + 1 ) - 1;
}

As you can probably guess, after years or reading trick questions, only the last one actually works as far as my main loop is concerned. Why? Every number between -1 and +1 becomes zero when you cast the double to an integer. That’s twice as big a range as any other integer gets. So, for the first implementation, the middle bucket has about twice as many things in it as it should. For the second implementation, the first bucket has more things in it than it should. For the final implementation, the non-existent bucket before the first one is the overloaded bucket. In the end, I used this implementation instead so that I wouldn’t even bias non-existent buckets:

int val_to_bucket( double _val ) {
  return (int)lround(_val) + ( BUCKET_COUNT / 2 );
}

Why Not Return A Function? July 2nd, 2009
Patrick Stein

This morning, I was catching up on the RSS feeds I follow. I noticed an interesting snippet of code in the Abstract Heresies journal for the Discrete-time Fourier Transform. Here is his snippet of code:

(define (dtft samples)
  (lambda (omega)
    (sum 0 (vector-length samples)
        (lambda (n)
          (* (vector-ref samples n)
             (make-polar 1.0 (* omega n)))))))

My first thought was That’s way too short. Then, I started reading through it. My next thought was, maybe I don’t understand scheme at all. Then, my next thought was, I do understand this code, it just didn’t do things the way I expected.

Here, I believe is a decent translation of the above into Common Lisp:

(defun dtft (samples)
  #'(lambda (omega)
      (loop for nn from 0 below (length samples)
           summing (let ((angle (* omega nn)))
                     (* (aref samples nn)
                        (complex (cos angle) (sin angle)))))))

Now, what I find most interesting here is that most implementations you’ll find for the DTFT (Discrete-Time Fourier Transform) take an array of samples and a wavelength, and returns a result. This, instead, returns a function which you can call with a wavelength omega. It returns an evaluator. This is an interesting pattern that I will have to try to keep in mind. I have used it in some places before. But, I am sure there are other places that I should have used it, and missed. This is one where I would have missed.

Usage for the above would go something like:

(let ((dd (dtft samples)))
  (dd angle1)
  (dd angle2)
  ...)

For those not into Lisp either (who are you?), here is a rough translation into C++.

#include <cmath>
#include <complex>

class DTFT {
  unsigned int len;
  double* samples;

public:
  DTFT( double* _samples, unsigned int _len ) {
    this->len = _len;
    this->samples = new double[ this->len ];
    for ( unsigned int ii=0; ii < this->len; ++ii ) {
      this->samples[ ii ] = _samples[ ii ];
    }
  }

  ~DTFT( void ) {
    delete[] this->samples;
  }

  std::complex< double > operator () ( double omega ) const {
    std::complex< double > sum( 0.0, 0.0 );
    for ( unsigned int ii=0; ii < this->len; ++ii ) {
      sum += this->samples[ ii ] * std::polar( 1.0, omega );
    }
    return sum;
  }
};

With usage like:

DTFT dd( samples, 1024 );
dd( angle1 );
dd( angle2 );
...

So, six lines of Scheme or Lisp. Twenty-five lines of C++ including explicit definition of a class to act as a pseudo-closure, explicit copying and management of the samples buffer, etc. I suppose, a more direct translation would have used a std::vector to hold the samples and would have just kept a pointer to the buffer. That would have shaved off six or seven lines and the whole len and _len variables.

To CLOS or not to CLOS June 29th, 2009
Patrick Stein

I am working on some Lisp code. I am trying to mimic the basic structure of a large C++ project. I think the way the C++ project is structured is a good fit for the tasks involved.

Most of the C++ stuff is done with classes. Most of the methods of those classes are virtual methods. Many of the methods will be called multiple times every hundredth of a second. Very few of the virtual methods ever get overridden by subclasses.

So, I got to thinking. Does it make sense for me to use CLOS stuff at all for most of this? Would it be significantly faster to use (defstruct …) and (defun …) instead of (defclass …) and (defgeneric …)?

My gut instinct was: Yes. My first set of test code didn’t bear that out. For both the classes and the structs, I used (MAKE-INSTANCE …) to allocate and (WITH-SLOTS …) to access. When doing so, the classes with generic functions took about 1.5 seconds for 10,000,000 iterations under SBCL while the structs with non-generic functions took about 2.2 seconds.

For the next iteration of testing, I decided to use accessors instead of slots. I had the following defined:

(defclass base-class ()
  ((slot-one :initform (random 1.0f0) :type single-float
             :accessor class-slot-one)
   (slot-two :initform (random 1.0f0) :type single-float
             :accessor class-slot-two)))

(defclass sub-class (base-class)
  ((slot-three :initform (random 1.0f0) :type single-float
               :accessor class-slot-three)
   (slot-four  :initform (random 1.0f0) :type single-float
               :accessor class-slot-four)))

(defstruct (base-struct)
  (slot-one (random 1.0f0) :type single-float)
  (slot-two (random 1.0f0) :type single-float))

(defstruct (sub-struct (:include base-struct))
  (slot-three (random 1.0f0) :type single-float)
  (slot-four (random 1.0f0) :type single-float))

I then switched from using calls like (slot-value instance ’slot-three) in my functions to using calls like (class-slot-three instance) or (sub-struct-slot-three instance). Now, 10,000,000 iterations took 2.6 seconds for the classes and 0.3 seconds for the structs in SBCL. In Clozure (64-bit), 10,000,000 iterations with classes and with method-dispatch and with accessors took 11.0 seconds, with classess and without method-dispatch but with accessors took 6.1 seconds, and with structs and without method-dispatch and with accessors took 0.4 seconds.

There is much more that I can explore to tune this code. But, for now, I think the answer is fairly easy. I am going to use structs, functions, and accessors for as many cases as I can. Here is the generic-function, class, struct test code with accessors. The first loop uses classes and generic functions. The second loop uses classes and regular functions. The third loop uses structs and regular functions.

How I Know I Don’t Know Enough About Lisp Macros June 19th, 2009
Patrick Stein

I have a pretty good grasp on macros in Lisp, C/C++, TeX, m4, and nroff. Lisp macros are, far and away, the most powerful in that set. In fact, Lisp macros get a bad rap by sharing a name with macros in those other languages. Still, I often hit the same wall when trying to write a Lisp macro so I know that there’s more I have to learn.

A Quick Look at Macros

First, let’s review macros for a moment to make sure we’re all on the same page. The C/C++ macros are the simplest to understand so we’ll start there. The C and C++ compilers run in two stages. First, the C preprocessor runs to expand all of your macros. Then, the compiler runs to actually compile your code. In C and C++, macros are declared using the #define command to the preprocessor. For example, you might do something like the following:
Read the rest of this entry ⇒

Updates In Email

Email:

l