The case for an anti-lexer - Re: Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language.

A note to say thanks for all the input. Planetscape´s was the most inspiring because as it turned out XML::Simple wouldn't compile under version 5.005_003 so, after trying to 'fix' it but finding the task too lengthy I soon found myself seeking approval to spend the 8 hours budget writing a parser from scratch. Fortunately, it was given and the result, one working day at this client later is shown below.

A couple of notes first:

1) this is only tested against the particular XML it needed to parse.

2) I am about to modify it so that it has SAX-like support except that I need to hand-roll that as well for the same reasons. The unmodified version can be regarded as the subroutine GetTag shown below because the actual data is one huge tag. Apart from this being non-generic, I will want to process at one level lower than that and SAX-style looks the best kind of algorithm.

3) Owing to the tag - body - terminator structure, I found it easier to write an antilexer than a normal one, i.e. it returns everything up to a choice of matched terminating patterns instead of matching the immediate content. This appears to have a number of advantages over a positive lexer:

- no need for a token table

- language independent

- trivialises the lexer and thrower routines even more than usual with Perl

An extra mechanism is a Step routine to step over the terminating expression which being logically known didn't usually need to be lexed specifically. This is no downside in terms of overall code length, however.

Note also that this version takes a file handle as argument, but if that is omitted, it will parse $_, which it does anyway at lower levels.

#!/usr/bin/perl -w
#
# @(#) ReportValidationAuditTrail.sh 1.0
#
# Author: (Moron)
#
# Versie 1.0  16 oktober 2006

use strict;
use locale;
use Time::localtime;
use lib $ENV{AC_PERLLIB};
use Env;
use Utilities;

use IPC::Open3; 
use POSIX ":sys_wait_h"; 

# ... main program logic omitted as being off-topic
# ...

sub GetTag {
    my $fh = shift;
    my $pastNoise = 0;
    my ( $tag, $sts, $cnt, $twixt );
    do { # walk past comment tags e.g. <?version ... >
        Throw( $fh ); # walk past whitespace and "\n"s
        /^\</ or XMLerror( 'Format' );
        Step(); # step over one char 
        ( $tag, $sts ) = AntiLex( $fh, '\W' );  # collect data until \
+W 
                            # and then walk there
        unless( $pastNoise = $tag ) {
            ( $cnt, $sts ) = AntiLex( $fh, '\>' );
            /^\>/ or XMLerror( 'Comment Unclosed By > ' );
            Step();
        }
    } until ( $pastNoise );
    Throw( $fh);
    my $assignments = {};
    ASSMNT: for ( my $assco = 0; !/^\>/; $assco++ ) {
        my $kwd;
        ( $kwd, $sts ) = AntiLex( $fh, '\W', );
        unless ( $kwd )
        {
            # only valid way is no assignments
            ( $assco || !/^\>/ ) and XMLerror( 'Format' );
            last ASSMNT;
        }
        Throw( $fh );
        ( $cnt, $sts ) = AntiLex( $fh, '\=' );
        ( $cnt || !$sts ) and XMLerror( 'Format' );
        Step();
        Throw( $fh );
        my $val = '';

        # error-check for something before quotes
        ( $twixt, $sts ) = AntiLex( $fh, '\"' );
        Step();
        $twixt and XMLerror( 'Format' );
        do { # quotes loop
            ( $cnt, $sts ) = AntiLex( $fh, '\"', '\\\"' );
            $sts or XMLerror( 'Unclosed Quote' );
            $val .= $cnt;
            length() or $_ = <$fh>;
        } until ( /^\"/ ); # i.e. include \" as part of string
        Step();
        $assignments -> { $kwd } = $val;
        Throw( $fh );
        length() or XMLerror( 'Unexpected End Of XML' );
    }
    Step();

    Throw( $fh );

    # case of simple value for current tag ...
    my $simple = '';
    /^</ or ( $simple, $sts ) = AntiLex( $fh, '<' );

    my @subtags = (); # collect nested tags to current tag
    while ( !$simple && /^\<(.)/ && ($1 ne '/' ) ) {
        push @subtags, GetTag( $fh );
        Throw( $fh );
    }

    AntiLex( $fh, '<' ); # * ... see comment below

    if ( /^\<\/(\w+)\>(.*)/ ) {
        ( $1 eq $tag ) or XMLerror( 'Tag Nesting' );
        $_ = $2; # walk past closing tag.
        $simple and return { $tag => $simple };
        return { $tag => { ASSMNTS => $assignments,
                   SUBTAGS => \@subtags  } };
    }

    XMLerror( "Format" ); # anything okay between '*' and here was eli
+minated from suspicion.

} 

# subroutine to walk past whitespace.
sub Throw { return ( AntiLex( shift(), '\S' ) ); }

sub Step { # like chopping $_ but from the LEFT of the string
    s/^(.)//;
    return $1;
}

sub XMLerror {
    my $reason = shift;
    my @ct = split( "\n" );
    die "XML $reason Error: $ct[0]";
}

sub AntiLex {     # - walk thru $_, reloading from optional fh if pres
+ent, until
        #   matching one of a list of regexps
        # - eats the returned content from $_ ready for
        #   repeated calls to this routine by the calling parser
        #
        # to parse positively just give it negative regexps.
        # the purpose is to roll up a lexer and thrower into a trivial
        # piece of code.

    # - SYNOPSIS: ( $content, $status ) = AntiLex ( [fh], { pattern, .
+.. } )

    my $fh = shift; # undef means simply: don't reload emptied $_ from
+ file

    my $contents = '';
    while ( 1 ) {
        unless( defined() && length() ) {
            defined( $fh ) and $_ = <$fh>;
            $_ or return ( $contents, 0 );
            chomp;
        }
        for my $pat ( @_ ) {
            ( /^($pat)(.*)$/ ) and return ( $contents, 1 );
        }
        $contents .= Step();
    } 
}
[download]

-M

Free your mind

Comment on The case for an anti-lexer - Re: Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language. Download Code