in reply to Need help in xml well formedness checker.

If you are using a windows system where you don't have admin privileges to install compiled code, that's a shame -- I really would expect the sysadmins to recognize the value of making the expat package available to the software development staff, seeing as it's sort of AN INDISPENSABLE INDUSTRY STANDARD for doing any kind of XML-related programming...

OTOH, if you're using any type of macosx/linux/unix system, you can install expat yourself in your own home or working directory, and install XML::whatever perl modules as well, just making sure that you specify your personal path for expat when installing the modules, and include the module path in @INC in your scripts (e.g. using "-I/path/to/your/modules" on the shebang line).

As for rolling your own work-around for well-formedness testing... I gather you've been making some changes to the original code from O'Reilly, but maybe you need to make different kinds of changes -- like adding some sort of "warning" output at each of the points where the "is_well_formed" function returns prematurely, so that you'll get some explanation of why a given XML text failed.

It turns out that the OP code has a problem here:

# match character data } elsif( $text =~ /(^[^&&>]+)/ ) { print "char data"; my $data = $1; # make sure the data is inside an element return if( $data =~ /\S/ and not( @elements )); $text = $';
Notice that the "return if(..." statement does a regex match, looking for any non-whitespace character. This causes the "$'" variable to be reset to the string that follows that match. So at the first occurrence of text data inside an element ("self" in this case), $text gets set to whatever follows the first non-whitespace character in $data ("elf" in this case). From that point on, of course, it's a failure.

(update: Forgot to mention: there was also a problem with a lot of the regexes, like the first one in the above snippet, where there was an ampersand when there should have been an open angle bracket; e.g. the one above should have been /(^[^<&>]+)/ )

Apart from that, though, it looks okay. Here's a fixed version, complete with informative warning messages at each of the return points that result from badly formed data, and a final print-out of the end result (okay or bad). To generalize it further, you would replace the string arg in the initial subroutine call with some variable whose content was slurped from STDIN or a file (and you might want to get rid of the initial print statement that I put into the sub).

$rc=is_well_formed ("<memo> <to>self</to> <message>Don't forget to mow the car and wash the lawn.</message> </memo>"); $report = ( $rc ) ? "okay" : "bad"; print "$report\n"; sub is_well_formed { my $text = shift; # XML text to check print "text to check:\n$text\n"; # match patterns my $ident = '[:_A-Za-z][:A-Za-z0-9\._-]*'; # identifier my $optsp = '\s*'; # optional space my $att1 = "$ident$optsp=$optsp\"[^\"]*\""; # attribute my $att2 = "$ident$optsp=$optsp'[^']*'"; # attr. variant my $att = "($att1|$att2)"; # any attribute my @elements = ( ); # stack of open elems # loop through the string to pull out XML markup objects while( length($text) ) { # match an empty element if( $text =~ /^<($ident)(\s+$att)*\s*\/>/ ) { $text = $'; # match an element start tag } elsif( $text =~ /^<($ident)(\s+$att)*\s*>/ ) { push( @elements, $1 ); $text = $'; # match an element end tag } elsif( $text =~ /^<\/($ident)\s*>/ ) { if ( $1 ne pop @elements ) { warn "close tag $1 lacks open tag\n"; return; } $text = $'; # match a comment } elsif( $text =~ /^<!--/ ) { $text = $'; # bite off the rest of the comment if( $text =~ /-->/ ) { $text = $'; if( $` =~ /--/ ) { warn "comment can't contain '--'\n"; return; } } else { warn "no end-comment fount after start-comment\n"; return; } # match extra whitespace # (in case there is space outside the root element) } elsif( $text =~ m|^\s+| ) { $text = $'; # match character data } elsif( $text =~ /(^[^<&>]+)/ ) { my $data = $1; $text = $'; # make sure the data is inside an element if( $data =~ /\S/ and not( @elements )) { warn "data found outside root element\n"; return; } # match entity reference } elsif( $text =~ /^&$ident;+/ ) { $text = $'; # something unexpected } else { warn "something unexpected\n"; return; } } if( @elements ) { warn "elements not closed at end of stream: @elements\n"; return; } return 1; }

Replies are listed 'Best First'.
Re^2: Need help in xml well formedness checker.
by jayakumark (Acolyte) on Feb 21, 2007 at 10:53 UTC
    Thank you very much graff, It works...