Embedding a mini-language for XML construction into Perl

In this meditation, we will embed a mini-language for building XML documents into Perl. Our goal is to see how much syntax we can remove in pursuit of what Damian Conway calls "sufficiently advanced technologies." We want to make building XML just like writing native Perl:

# html {
#     head { title { text "Title" } };
#     body {
#         p { class_ "warning"; text "paragraph" }
#     };
# }
[download]

There is nothing particularly novel about this approach, and there are similar libraries for many programming languages. Our implementation, however, will stress Perlishness and simplicity. To eliminate clutter during the meditation, we will not make a module but instead expose the underlying code.

Here is our game plan. We will represent XML documents as trees of nested arrays and then render the trees as XML. A node in our tree will be either text (represented as a string) or an element (represented as a triple of the form [name, attributes, children_nodes]). Attributes will be pairs of the form [name, value]. (We will ignore namespaces, XML declarations, and other aspects of XML generation that don't add much to the meditation.)

To build a document, we will call functions that append elements, attributes, and text to the active node in the tree, redefining the active node in passing:

To add an element called name, we will call the function name and pass it a function that will construct its attributes and children.
To add an attribute called name, we will call the function name_ and pass it the attribute's value (mnemonic: the _ stands for the equals sign in name="val").
To add text, will call the function text and pass it the text.
Finally, we will create one additional helper called doc that will create an empty document; we will use it to "root" documents created by the earlier functions.

To make it all seem more natural, we will use the (&) prototype on element-creating functions and doc. This lets us use braces to represent nesting when calling the functions:

# doc {
#     my_elem {
#         # children go here
#     };
# };
[download]

Likewise, the attribute-creating functions and text get the ($) prototype. This lets us call them without having to use parentheses:

# my_elem {
#     text "some text";
#     my_attr_ "value";
# };
[download]

With the game plan in mind, let's work top down:

our $__frag;  # points to fragment under active construction

sub doc(&) {
    my ($content_fn) = @_;
    local $__frag = [undef,undef,undef];
    $content_fn->();
    $__frag->[2][0];
}

sub _elem {
    my ($elem_name, $content_fn) = @_;
    # an element is represented by the triple [name, attrs, children]
    my $elem = [$elem_name, undef, undef];
    do { local $__frag = $elem; $content_fn->() };
    push @{$__frag->[2]}, $elem;
}

sub _attr {
    my ($attr_name, $val) = @_;
    push @{$__frag->[1]}, [$attr_name, $val];
}

sub text($) {
    push @{$__frag->[2]}, @_;
}
[download]

The functions _elem and _attr are helpers used by the following function, which lets us embed a custom XML vocabulary into Perl by creating the appropriate Perl functions for the vocabulary's elements and attributes:

sub define_vocabulary {
    my ($elems, $attrs) = @_;
    eval "sub $_(&) { _elem('$_',\@_) }"     for @$elems;
    eval "sub ${_}_(\$) { _attr('$_',\@_) }" for @$attrs;
}
[download]

We can use the above function, for example, to embed a subset of XHTML into Perl:

BEGIN {
    define_vocabulary(
        [qw( html head title body h1 h2 h3 p img br )],
        [qw( src href class style )]
    );
}
[download]

(The use of BEGIN ensures that the embedded functions' prototypes are established before any remaining code is compiled.)

Let's try out our newly embedded vocabulary by dumping out the internal representation of a simple document:

my $my_doc = doc {
    html {
        head { title { text "Title" } };
        body {
            p { class_ "warning"; text "paragraph" }
        };
    }
};

use Data::Dumper;
$Data::Dumper::Indent = $Data::Dumper::Terse = 1;
print Dumper $my_doc;
# [
#   'html',
#   undef,
#   [
#     [
#       'head',
#       undef,
#       [
#         [
#           'title',
#           undef,
#           [
#             'Title'
#           ]
#         ]
#       ]
#     ],
#     [
#       'body',
#       undef,
#       [
#         [
#           'p',
#           [
#             [
#               'class',
#               'warning'
#             ]
#           ],
#           [
#             'paragraph'
#           ]
#         ]
#       ]
#     ]
#   ]
# ]
[download]

Good! That's just what we want.

All that is left for us to do is render the internal representation as XML. The simplicity of our internal representation makes this straightforward. Here's a renderer for XML::Writer:

use XML::Writer;

sub render_via_xml_writer {
    my $doc = shift;
    my $writer = XML::Writer->new(@_);  # extra args go to ->new()
    my $render_fn;
    $render_fn = sub {
        my $frag = shift;
        my ($elem, $attrs, $children) = @$frag;
        $writer->startTag( $elem, map {@$_} @$attrs );
        for (@$children) {
            ref() ? $render_fn->($_) : $writer->characters($_);
        }
        $writer->endTag($elem);
    };
    $render_fn->($doc);
    $writer->end();
}
[download]

Now we can render our earlier document:

render_via_xml_writer( $my_doc, DATA_MODE => 1, UNSAFE => 1 );
# <html>
# <head>
# <title>Title</title>
# </head>
# <body>
# <p class="warning">paragraph</p>
# </body>
# </html>
[download]

In most cases we will render documents shortly after creating them. We can "huffmanize" this common case with another helper, which supplies the outer doc for us and then renders the resulting tree:

sub render_doc(&) {
    my $docfn = shift;
    render_via_xml_writer(
        doc( \&$docfn ),
        DATA_MODE => 1,
        UNSAFE => 1
    );
}
[download]

Our final example shows the fruits of our labors. We have successfully embedded a custom subset of XHTML into Perl. Now we can use it to create XML fragments with very little syntactic overhead. Further, because our embedding is "just Perl," we can freely mix code and fragments to do the work of template engines:

render_doc {
    html {
        head { 
            title { text "My grand document!" }
        };
        body {
            h1 { text "Heading" };
            p {
                class_ "first";       # attribute class="first"
                text "This is the first paragraph!";
                style_ "font: bold";  # another attr
            };
            # it's just Perl, so we can mix in other code
            for (2..5) {
                p { text "Plus paragraph number $_." }
            }
        };
    };
};
# <html>
# <head>
# <title>My grand document!</title>
# </head>
# <body>
# <h1>Heading</h1>
# <p class="first" style="font: bold">This is the first paragraph!</p>
# <p>Plus paragraph number 2.</p>
# <p>Plus paragraph number 3.</p>
# <p>Plus paragraph number 4.</p>
# <p>Plus paragraph number 5.</p>
# </body>
# </html>
[download]

Thanks for taking the time to read this meditation! If you find anything about it unclear, or can think of a way to improve my writing, please let me know.

Cheers
Tom

Tom Moertel : Blog / Talks / CPAN / LectroTest / PXSL / Coffee / Movie Rating Decoder

Back to Meditations