in reply to Need help in xml well formedness checker.

That /^&/ in all your regexes looks wrong to me. I'm quite certain you ment to match a "<" (almost) every time.

And I'd stay away from the $' variable and friends. Use @+, more particular, $+[0], to get the end of the match, and next, remove it with substr. Or, you could just use s/^$pattern// instead.

p.s. Don't let the other monks get to you, they're trained to yell "use a parser!" as soon as they see a somewhat complicated parsing job. I think you have a correct approach, one that, once debugged, could give correct results. (Caveat: I haven't inspected every detail) After all, the wellformedness check for HTML in user posts on this site, yes, Perlmonks, is somewhat similar in construct.

Update Oh, I see you got it off a website. That's where it got mangled. And you got it off O'Reilly. No wonder it's pretty decent.

Anyway, I've updated the script, fixing the bugs I mentioned, and modernizing it so it uses qr// instead of bare strings and s/^blah// instead of depending on $'. I also added the /o option to the regexes so they get compiled only once.

And guess what: it works.

update: Apparently it can't handle numerical entities, only named entities. I've added support for the former.

#! perl -w $rc=is_well_formed ("<memo> <to>self</to> <message>Don't forget to mow the car and wash the lawn.</message> </memo>"); print $rc ? 'success' : 'failure'; sub is_well_formed { local $\ = "\n"; my $text = shift; # XML text to check # match patterns my $ident = qr/[:_A-Za-z][:A-Za-z0-9\-\._]*/; # identifier my $optsp = qr/\s*/; # optional whitesp +ace my $att1 = qr/$ident$optsp=$optsp"[^"]*"/; # attribute my $att2 = qr/$ident$optsp=$optsp'[^']*'/; # attr. variant my $att = qr/$att1|$att2/; # any attribute my @elements = ( ); # stack of open elems print "Identifier $ident"; print "optsp $optsp"; print "att $att"; # loop through the string to pull out XML markup objects while( length($text) ) { print "Inside Loop"; # match an empty element if( $text =~ s/^<($ident)(\s+$att)*\s*\/>//o ) { # match an element start tag } elsif( $text =~ s/^<($ident)(\s+$att)*\s*>//o ) { push( @elements, $1 ); # match an element end tag } elsif( $text =~ s/^<\/($ident)\s*>//o ) { return unless( $1 eq pop( @elements )); # match a comment } elsif( $text =~ s/^<!--//o ) { # bite off the rest of the comment if( $text =~ s/^.*?--(>?)//os ) { return if(!$1); # comments can't contain '--' } else { return; } # match extra whitespace # (in case there is space outside the root element) } elsif( $text =~ s/^\s+//o ) { # match character data } elsif( $text =~ s/^([^<&>]+)//o ) { print "char data"; my $data = $1; # make sure the data is inside an element return if( $data =~ /\S/ and not( @elements )); # match entity reference } elsif( $text =~ s/^&$ident;//o ) { # match numerical entity reference # added by bart } elsif( $text =~ s/^&#(?:\d+|x[0-9a-f]+);//io ) { # something unexpected } else { return; } } return if( @elements ); # the stack should be empty return 1; }

Update 20070404: tye remarked that the ";" at the end of entities shouldn't be repeatable. Fixed (dropped the "+")

Result:

Identifier (?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*) optsp (?-xism:\s*) att (?-xism:(?-xism:(?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*)(?-xism:\s*)= +(?-xism:\s*)"[^"]*")|(?-xism:(?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*)(?- +xism:\s*)=(?-xism:\s*)'[^']*')) Inside Loop Inside Loop Inside Loop Inside Loop char data Inside Loop Inside Loop Inside Loop Inside Loop char data Inside Loop Inside Loop Inside Loop success

Replies are listed 'Best First'.
Re^2: Need help in xml well formedness checker.
by jayakumark (Acolyte) on Feb 22, 2007 at 11:43 UTC
    Thanks bart.I have updated my script with this code.It works great.