XML Parser Generator March 16th, 2010
Patrick Stein

A few years back (for a very generous few), we needed to parse a wide variety of XML strings. It was quite tedious to go from the XML to the native-language representations of the data (even from a DOM version). Furthermore, we needed to parse this XML both in Java and in C++.

I wrote (in Java) an XML parser generator that took an XML description of how you’d like the native-language data structures to look and where in the XML it could find the values for those data structures. The Java code-base for this was ugly, ugly, ugly. I tried several times to clean it up into something publishable. I tried to clean it up several times so that it could actually generate the parser it used to read the XML description file. Alas, the meta-ness, combined with the clunky Java code, kept me from completing the circle.

Fast forward to last week. Suddenly, I have a reason to parse a wide variety of XML strings in Objective C. I certainly didn’t want to pull out the Java parser generator and try to beat it into generating Objective C, too. That’s fortunate, too, because I cannot find any of the copies (in various states of repair) that once lurked in ~/src.

What’s a man to do? Write it in Lisp, of course.


Here’s an example to show how it works. Let’s take some simple XML that lists food items on a menu:

        <food name="Belgian Waffles" price="$5.95" calories="650">
                <description>two of our famous Belgian Waffles with plenty of real maple syrup</description>
        <!-- ... more food entries, omitted here for brevity ... -->

We craft an XML description of how to go from the XML into a native representation:

<parser_generator root="menu" from="/menu">
  <struct name="food item">
    <field type="string" name="name" from="@name" />
    <field type="string" name="price" from="@price" />
    <field type="string" name="description" from="/description/." />
    <field type="integer" name="calories" from="@calories" />

  <struct name="menu">
    <field name="menu items">
        <array_element type="food item" from="/food" />

Now, you run the parser generator on the above input file:

% sh parser-generator.sh --language=lisp \
                           --types-package menu \
                           --reader-package menu-reader \
                           --file menu.xml

This generates two files for you: types.lisp and reader.lisp. This is what types.lisp looks like:

(defpackage :menu
  (:use :common-lisp)
  (:export #:food-item

(in-package :menu)

(defclass food-item ()
  ((name :initarg :name :type string)
   (price :initarg :price :type string)
   (description :initarg :description :type string)
   (calories :initarg :calories :type integer)))

(defclass menu ()
  ((menu-items :initarg :menu-items :type list :initform nil)))

I will not bore you with all of reader.lisp as it’s 134 lines of code you never had to write. The only part you need to worry about is the parse function which takes a stream for or pathname to the XML and returns an instance of the menu class. Here is a small snippet though:

;;; =================================================================
;;; food-item struct
;;; =================================================================
(defmethod data progn ((handler sax-handler) (item food-item) path value)
  (with-slots (name price description calories) item
    (case path
      (:|@name| (setf name value))
      (:|@price| (setf price value))
      (:|/description/.| (setf description value))
      (:|@calories| (setf calories (parse-integer value))))))

Where it’s at

I currently have the parser generator generating its own parser (five times fast). I still have a little bit more that I’d like to add to include assertions for things like the minimum number of elements in an array or the minimum value of an integer. I also have a few kinks to work out so that you can return some type other than an instance of a class for cases like this where the menu class just wraps one item.

My next step though is to get it generating Objective C parsers.

Somewhere in there, I’ll post this to a public git repository.

Updates In Email