comment on

If you are using a windows system where you don't have admin privileges to install compiled code, that's a shame -- I really would expect the sysadmins to recognize the value of making the expat package available to the software development staff, seeing as it's sort of AN INDISPENSABLE INDUSTRY STANDARD for doing any kind of XML-related programming...

OTOH, if you're using any type of macosx/linux/unix system, you can install expat yourself in your own home or working directory, and install XML::whatever perl modules as well, just making sure that you specify your personal path for expat when installing the modules, and include the module path in @INC in your scripts (e.g. using "-I/path/to/your/modules" on the shebang line).

As for rolling your own work-around for well-formedness testing... I gather you've been making some changes to the original code from O'Reilly, but maybe you need to make different kinds of changes -- like adding some sort of "warning" output at each of the points where the "is_well_formed" function returns prematurely, so that you'll get some explanation of why a given XML text failed.

It turns out that the OP code has a problem here:

        # match character data
        } elsif( $text =~ /(^[^&&>]+)/ ) {
print "char data";
            my $data = $1;
            # make sure the data is inside an element
            return if( $data =~ /\S/ and not( @elements ));
            $text = $';
[download]

Notice that the "return if(..." statement does a regex match, looking for any non-whitespace character. This causes the "$'" variable to be reset to the string that follows that match. So at the first occurrence of text data inside an element ("self" in this case), $text gets set to whatever follows the first non-whitespace character in $data ("elf" in this case). From that point on, of course, it's a failure.

(update: Forgot to mention: there was also a problem with a lot of the regexes, like the first one in the above snippet, where there was an ampersand when there should have been an open angle bracket; e.g. the one above should have been /(^[^<&>]+)/ )

Apart from that, though, it looks okay. Here's a fixed version, complete with informative warning messages at each of the return points that result from badly formed data, and a final print-out of the end result (okay or bad). To generalize it further, you would replace the string arg in the initial subroutine call with some variable whose content was slurped from STDIN or a file (and you might want to get rid of the initial print statement that I put into the sub).

$rc=is_well_formed ("<memo>
  <to>self</to>
  <message>Don't forget to mow the car and wash the
  lawn.</message>
</memo>");
$report = ( $rc ) ? "okay" : "bad";
print "$report\n";

sub is_well_formed {
    my $text = shift;                     # XML text to check

    print "text to check:\n$text\n";
    # match patterns
    my $ident = '[:_A-Za-z][:A-Za-z0-9\._-]*';   # identifier
    my $optsp = '\s*';                            # optional space
    my $att1 = "$ident$optsp=$optsp\"[^\"]*\"";   # attribute
    my $att2 = "$ident$optsp=$optsp'[^']*'";      # attr. variant
    my $att = "($att1|$att2)";                    # any attribute

    my @elements = ( );                    # stack of open elems

    # loop through the string to pull out XML markup objects
    while( length($text) ) {
        # match an empty element
        if( $text =~ /^<($ident)(\s+$att)*\s*\/>/ ) {
            $text = $';

        # match an element start tag
        } elsif( $text =~ /^<($ident)(\s+$att)*\s*>/ ) {
            push( @elements, $1 );
            $text = $';

        # match an element end tag
        } elsif( $text =~ /^<\/($ident)\s*>/ ) {
            if ( $1 ne pop @elements ) {
                warn "close tag $1 lacks open tag\n";
                return;
            }
            $text = $';

        # match a comment
        } elsif( $text =~ /^<!--/ ) {
            $text = $';
            # bite off the rest of the comment
            if( $text =~ /-->/ ) {
                $text = $';
                if( $` =~ /--/ ) {
                    warn "comment can't contain '--'\n";
                    return;
                }
            } else {
                warn "no end-comment fount after start-comment\n";
                return;
            }

        
        # match extra whitespace
        # (in case there is space outside the root element)
        } elsif( $text =~ m|^\s+| ) {
            $text = $';

        # match character data
        } elsif( $text =~ /(^[^<&>]+)/ ) {
            my $data = $1;
            $text = $';
            # make sure the data is inside an element
            if( $data =~ /\S/ and not( @elements )) {
                warn "data found outside root element\n";
                return;
            }
            
        # match entity reference
        } elsif( $text =~ /^&$ident;+/ ) {
            $text = $';
         
        # something unexpected
        } else {
            warn "something unexpected\n";
            return;
        }
    }
    if( @elements ) {
        warn "elements not closed at end of stream: @elements\n";
        return;
    }
    return 1;
}
[download]

In reply to Re: Need help in xml well formedness checker. by graff
in thread Need help in xml well formedness checker. by jayakumark

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.