comment on

That /^&/ in all your regexes looks wrong to me. I'm quite certain you ment to match a "<" (almost) every time.

And I'd stay away from the $' variable and friends. Use @+, more particular, $+[0], to get the end of the match, and next, remove it with substr. Or, you could just use s/^$pattern// instead.

p.s. Don't let the other monks get to you, they're trained to yell "use a parser!" as soon as they see a somewhat complicated parsing job. I think you have a correct approach, one that, once debugged, could give correct results. (Caveat: I haven't inspected every detail) After all, the wellformedness check for HTML in user posts on this site, yes, Perlmonks, is somewhat similar in construct.

Update Oh, I see you got it off a website. That's where it got mangled. And you got it off O'Reilly. No wonder it's pretty decent.

Anyway, I've updated the script, fixing the bugs I mentioned, and modernizing it so it uses qr// instead of bare strings and s/^blah// instead of depending on $'. I also added the /o option to the regexes so they get compiled only once.

And guess what: it works.

update: Apparently it can't handle numerical entities, only named entities. I've added support for the former.

#! perl -w
$rc=is_well_formed ("<memo>
  <to>self</to>
  <message>Don't forget to mow the car and wash the
  lawn.</message>
</memo>");
print $rc ? 'success' : 'failure';
sub is_well_formed {
    local $\ = "\n";
    my $text = shift;                     # XML text to check

    # match patterns
    my $ident = qr/[:_A-Za-z][:A-Za-z0-9\-\._]*/;   # identifier
    my $optsp = qr/\s*/;                            # optional whitesp
+ace
    my $att1 = qr/$ident$optsp=$optsp"[^"]*"/;      # attribute
    my $att2 = qr/$ident$optsp=$optsp'[^']*'/;      # attr. variant
    my $att = qr/$att1|$att2/;                      # any attribute

    my @elements = ( );                    # stack of open elems
print "Identifier $ident";
print "optsp $optsp";
print "att $att";

    # loop through the string to pull out XML markup objects
    while( length($text) ) {
print "Inside Loop";
        # match an empty element
        if( $text =~ s/^<($ident)(\s+$att)*\s*\/>//o ) {

        # match an element start tag
        } elsif( $text =~ s/^<($ident)(\s+$att)*\s*>//o ) {
            push( @elements, $1 );

        # match an element end tag
        } elsif( $text =~ s/^<\/($ident)\s*>//o ) {
            return unless( $1 eq pop( @elements ));

        # match a comment
        } elsif( $text =~ s/^<!--//o ) {
            # bite off the rest of the comment
            if( $text =~ s/^.*?--(>?)//os ) {
                return if(!$1);  # comments can't contain '--'
            } else {
                return;
            }


        # match extra whitespace
        # (in case there is space outside the root element)
        } elsif( $text =~ s/^\s+//o ) {

        # match character data
        } elsif( $text =~ s/^([^<&>]+)//o ) {
print "char data";
            my $data = $1;
            # make sure the data is inside an element
            return if( $data =~ /\S/ and not( @elements ));

        # match entity reference
        } elsif( $text =~ s/^&$ident;//o ) {

        # match numerical entity reference
        # added by bart
        } elsif( $text =~ s/^&#(?:\d+|x[0-9a-f]+);//io ) {

        # something unexpected
        } else {
            return;
        }
    }
    return if( @elements );     # the stack should be empty
    return 1;

}
[download]

Update 20070404: tye remarked that the ";" at the end of entities shouldn't be repeatable. Fixed (dropped the "+")

Result:

Identifier (?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*)
optsp (?-xism:\s*)
att (?-xism:(?-xism:(?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*)(?-xism:\s*)=
+(?-xism:\s*)"[^"]*")|(?-xism:(?-xism:[:_A-Za-z][:A-Za-z0-9\-\._]*)(?-
+xism:\s*)=(?-xism:\s*)'[^']*'))
Inside Loop
Inside Loop
Inside Loop
Inside Loop
char data
Inside Loop
Inside Loop
Inside Loop
Inside Loop
char data
Inside Loop
Inside Loop
Inside Loop
success
[download]

In reply to Re: Need help in xml well formedness checker. by bart
in thread Need help in xml well formedness checker. by jayakumark

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.