A note to say thanks for all the input. Planetscape´s was the most inspiring because as it turned out XML::Simple wouldn't compile under version 5.005_003 so, after trying to 'fix' it but finding the task too lengthy I soon found myself seeking approval to spend the 8 hours budget writing a parser from scratch. Fortunately, it was given and the result, one working day at this client later is shown below.

A couple of notes first:

1) this is only tested against the particular XML it needed to parse.

2) I am about to modify it so that it has SAX-like support except that I need to hand-roll that as well for the same reasons. The unmodified version can be regarded as the subroutine GetTag shown below because the actual data is one huge tag. Apart from this being non-generic, I will want to process at one level lower than that and SAX-style looks the best kind of algorithm.

3) Owing to the tag - body - terminator structure, I found it easier to write an antilexer than a normal one, i.e. it returns everything up to a choice of matched terminating patterns instead of matching the immediate content. This appears to have a number of advantages over a positive lexer:

- no need for a token table

- language independent

- trivialises the lexer and thrower routines even more than usual with Perl

An extra mechanism is a Step routine to step over the terminating expression which being logically known didn't usually need to be lexed specifically. This is no downside in terms of overall code length, however.

Note also that this version takes a file handle as argument, but if that is omitted, it will parse $_, which it does anyway at lower levels.

#!/usr/bin/perl -w # # @(#) ReportValidationAuditTrail.sh 1.0 # # Author: (Moron) # # Versie 1.0 16 oktober 2006 use strict; use locale; use Time::localtime; use lib $ENV{AC_PERLLIB}; use Env; use Utilities; use IPC::Open3; use POSIX ":sys_wait_h"; # ... main program logic omitted as being off-topic # ... sub GetTag { my $fh = shift; my $pastNoise = 0; my ( $tag, $sts, $cnt, $twixt ); do { # walk past comment tags e.g. <?version ... > Throw( $fh ); # walk past whitespace and "\n"s /^\</ or XMLerror( 'Format' ); Step(); # step over one char ( $tag, $sts ) = AntiLex( $fh, '\W' ); # collect data until \ +W # and then walk there unless( $pastNoise = $tag ) { ( $cnt, $sts ) = AntiLex( $fh, '\>' ); /^\>/ or XMLerror( 'Comment Unclosed By > ' ); Step(); } } until ( $pastNoise ); Throw( $fh); my $assignments = {}; ASSMNT: for ( my $assco = 0; !/^\>/; $assco++ ) { my $kwd; ( $kwd, $sts ) = AntiLex( $fh, '\W', ); unless ( $kwd ) { # only valid way is no assignments ( $assco || !/^\>/ ) and XMLerror( 'Format' ); last ASSMNT; } Throw( $fh ); ( $cnt, $sts ) = AntiLex( $fh, '\=' ); ( $cnt || !$sts ) and XMLerror( 'Format' ); Step(); Throw( $fh ); my $val = ''; # error-check for something before quotes ( $twixt, $sts ) = AntiLex( $fh, '\"' ); Step(); $twixt and XMLerror( 'Format' ); do { # quotes loop ( $cnt, $sts ) = AntiLex( $fh, '\"', '\\\"' ); $sts or XMLerror( 'Unclosed Quote' ); $val .= $cnt; length() or $_ = <$fh>; } until ( /^\"/ ); # i.e. include \" as part of string Step(); $assignments -> { $kwd } = $val; Throw( $fh ); length() or XMLerror( 'Unexpected End Of XML' ); } Step(); Throw( $fh ); # case of simple value for current tag ... my $simple = ''; /^</ or ( $simple, $sts ) = AntiLex( $fh, '<' ); my @subtags = (); # collect nested tags to current tag while ( !$simple && /^\<(.)/ && ($1 ne '/' ) ) { push @subtags, GetTag( $fh ); Throw( $fh ); } AntiLex( $fh, '<' ); # * ... see comment below if ( /^\<\/(\w+)\>(.*)/ ) { ( $1 eq $tag ) or XMLerror( 'Tag Nesting' ); $_ = $2; # walk past closing tag. $simple and return { $tag => $simple }; return { $tag => { ASSMNTS => $assignments, SUBTAGS => \@subtags } }; } XMLerror( "Format" ); # anything okay between '*' and here was eli +minated from suspicion. } # subroutine to walk past whitespace. sub Throw { return ( AntiLex( shift(), '\S' ) ); } sub Step { # like chopping $_ but from the LEFT of the string s/^(.)//; return $1; } sub XMLerror { my $reason = shift; my @ct = split( "\n" ); die "XML $reason Error: $ct[0]"; } sub AntiLex { # - walk thru $_, reloading from optional fh if pres +ent, until # matching one of a list of regexps # - eats the returned content from $_ ready for # repeated calls to this routine by the calling parser # # to parse positively just give it negative regexps. # the purpose is to roll up a lexer and thrower into a trivial # piece of code. # - SYNOPSIS: ( $content, $status ) = AntiLex ( [fh], { pattern, . +.. } ) my $fh = shift; # undef means simply: don't reload emptied $_ from + file my $contents = ''; while ( 1 ) { unless( defined() && length() ) { defined( $fh ) and $_ = <$fh>; $_ or return ( $contents, 0 ); chomp; } for my $pat ( @_ ) { ( /^($pat)(.*)$/ ) and return ( $contents, 1 ); } $contents .= Step(); } }

-M

Free your mind


In reply to The case for an anti-lexer - Re: Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language. by Moron
in thread Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language. by Moron

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.