in reply to Need help in xml well formedness checker.
And I'd stay away from the $' variable and friends. Use @+, more particular, $+[0], to get the end of the match, and next, remove it with substr. Or, you could just use s/^$pattern// instead.
p.s. Don't let the other monks get to you, they're trained to yell "use a parser!" as soon as they see a somewhat complicated parsing job. I think you have a correct approach, one that, once debugged, could give correct results. (Caveat: I haven't inspected every detail) After all, the wellformedness check for HTML in user posts on this site, yes, Perlmonks, is somewhat similar in construct.
Update Oh, I see you got it off a website. That's where it got mangled. And you got it off O'Reilly. No wonder it's pretty decent.
Anyway, I've updated the script, fixing the bugs I mentioned, and modernizing it so it uses qr// instead of bare strings and s/^blah// instead of depending on $'. I also added the /o option to the regexes so they get compiled only once.
And guess what: it works.
update: Apparently it can't handle numerical entities, only named entities. I've added support for the former.
#! perl -w $rc=is_well_formed ("<memo> <to>self</to> <message>Don't forget to mow the car and wash the lawn.</message> </memo>"); print $rc ? 'success' : 'failure'; sub is_well_formed { local $\ = "\n"; my $text = shift; # XML text to check # match patterns my $ident = qr/[:_A-Za-z][:A-Za-z0-9\-\._]*/; # identifier my $optsp = qr/\s*/; # optional whitesp +ace my $att1 = qr/$ident$optsp=$optsp"[^"]*"/; # attribute my $att2 = qr/$ident$optsp=$optsp'[^']*'/; # attr. variant my $att = qr/$att1|$att2/; # any attribute my @elements = ( ); # stack of open elems print "Identifier $ident"; print "optsp $optsp"; print "att $att"; # loop through the string to pull out XML markup objects while( length($text) ) { print "Inside Loop"; # match an empty element if( $text =~ s/^<($ident)(\s+$att)*\s*\/>//o ) { # match an element start tag } elsif( $text =~ s/^<($ident)(\s+$att)*\s*>//o ) { push( @elements, $1 ); # match an element end tag } elsif( $text =~ s/^<\/($ident)\s*>//o ) { return unless( $1 eq pop( @elements )); # match a comment } elsif( $text =~ s/^<!--//o ) { # bite off the rest of the comment if( $text =~ s/^.*?--(>?)//os ) { return if(!$1); # comments can't contain '--' } else { return; } # match extra whitespace # (in case there is space outside the root element) } elsif( $text =~ s/^\s+//o ) { # match character data } elsif( $text =~ s/^([^<&>]+)//o ) { print "char data"; my $data = $1; # make sure the data is inside an element return if( $data =~ /\S/ and not( @elements )); # match entity reference } elsif( $text =~ s/^&$ident;//o ) { # match numerical entity reference # added by bart } elsif( $text =~ s/^&#(?:\d+|x[0-9a-f]+);//io ) { # something unexpected } else { return; } } return if( @elements ); # the stack should be empty return 1; }
Update 20070404: tye remarked that the ";" at the end of entities shouldn't be repeatable. Fixed (dropped the "+")
Result:
Identifier (?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*) optsp (?-xism:\s*) att (?-xism:(?-xism:(?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*)(?-xism:\s*)= +(?-xism:\s*)"[^"]*")|(?-xism:(?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*)(?- +xism:\s*)=(?-xism:\s*)'[^']*')) Inside Loop Inside Loop Inside Loop Inside Loop char data Inside Loop Inside Loop Inside Loop Inside Loop char data Inside Loop Inside Loop Inside Loop success
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Need help in xml well formedness checker.
by jayakumark (Acolyte) on Feb 22, 2007 at 11:43 UTC |