That /^&/ in all your regexes looks wrong to me. I'm quite certain you ment to match a "<" (almost) every time.

And I'd stay away from the $' variable and friends. Use @+, more particular, $+[0], to get the end of the match, and next, remove it with substr. Or, you could just use s/^$pattern// instead.

p.s. Don't let the other monks get to you, they're trained to yell "use a parser!" as soon as they see a somewhat complicated parsing job. I think you have a correct approach, one that, once debugged, could give correct results. (Caveat: I haven't inspected every detail) After all, the wellformedness check for HTML in user posts on this site, yes, Perlmonks, is somewhat similar in construct.

Update Oh, I see you got it off a website. That's where it got mangled. And you got it off O'Reilly. No wonder it's pretty decent.

Anyway, I've updated the script, fixing the bugs I mentioned, and modernizing it so it uses qr// instead of bare strings and s/^blah// instead of depending on $'. I also added the /o option to the regexes so they get compiled only once.

And guess what: it works.

update: Apparently it can't handle numerical entities, only named entities. I've added support for the former.

#! perl -w $rc=is_well_formed ("<memo> <to>self</to> <message>Don't forget to mow the car and wash the lawn.</message> </memo>"); print $rc ? 'success' : 'failure'; sub is_well_formed { local $\ = "\n"; my $text = shift; # XML text to check # match patterns my $ident = qr/[:_A-Za-z][:A-Za-z0-9\-\._]*/; # identifier my $optsp = qr/\s*/; # optional whitesp +ace my $att1 = qr/$ident$optsp=$optsp"[^"]*"/; # attribute my $att2 = qr/$ident$optsp=$optsp'[^']*'/; # attr. variant my $att = qr/$att1|$att2/; # any attribute my @elements = ( ); # stack of open elems print "Identifier $ident"; print "optsp $optsp"; print "att $att"; # loop through the string to pull out XML markup objects while( length($text) ) { print "Inside Loop"; # match an empty element if( $text =~ s/^<($ident)(\s+$att)*\s*\/>//o ) { # match an element start tag } elsif( $text =~ s/^<($ident)(\s+$att)*\s*>//o ) { push( @elements, $1 ); # match an element end tag } elsif( $text =~ s/^<\/($ident)\s*>//o ) { return unless( $1 eq pop( @elements )); # match a comment } elsif( $text =~ s/^<!--//o ) { # bite off the rest of the comment if( $text =~ s/^.*?--(>?)//os ) { return if(!$1); # comments can't contain '--' } else { return; } # match extra whitespace # (in case there is space outside the root element) } elsif( $text =~ s/^\s+//o ) { # match character data } elsif( $text =~ s/^([^<&>]+)//o ) { print "char data"; my $data = $1; # make sure the data is inside an element return if( $data =~ /\S/ and not( @elements )); # match entity reference } elsif( $text =~ s/^&$ident;//o ) { # match numerical entity reference # added by bart } elsif( $text =~ s/^&#(?:\d+|x[0-9a-f]+);//io ) { # something unexpected } else { return; } } return if( @elements ); # the stack should be empty return 1; }

Update 20070404: tye remarked that the ";" at the end of entities shouldn't be repeatable. Fixed (dropped the "+")

Result:

Identifier (?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*) optsp (?-xism:\s*) att (?-xism:(?-xism:(?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*)(?-xism:\s*)= +(?-xism:\s*)"[^"]*")|(?-xism:(?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*)(?- +xism:\s*)=(?-xism:\s*)'[^']*')) Inside Loop Inside Loop Inside Loop Inside Loop char data Inside Loop Inside Loop Inside Loop Inside Loop char data Inside Loop Inside Loop Inside Loop success

In reply to Re: Need help in xml well formedness checker. by bart
in thread Need help in xml well formedness checker. by jayakumark

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.