comment on

So, I will open by saying Anonymous Monk is right and you probably shouldn't be rolling your own here. You are highly unlikely to win the cost-benefit analysis with a home grown solution. I do think there is educational value in understanding how to do it, but this like crafting your own object system: go ahead and roll your own to understand the principles, and then use a well-tested one in production to CYA.

Let's presume you have a well-formed_document, and ignore the question as to whether it's valid for a particular XSD.

The first mistake you are making is thinking about an XML document's line structure as significant. While newlines and indentation are considered good form in an XML document, the standard is whitespace agnostic. Thus, you should be doing a slurp into a single variable. Something like:

#!/usr/local/bin/perl
use strict;
use warnings;

my $sandboxxml = do {
    open(my $fh, '<', $ARGV[0]) || die("sandbox xml file cannot be loa
+ded;check for file name or existance");
    local $/; # Slurp
    <$fh>;
};
[download]

Note that by having an indirect file handle in the do where I localize $/, the file is automatically closed once I'm done with it.

Second, comments can contain all sorts of text that might interfere with a parse. As well, an XML document may contain a CDATA block, which can contain very nearly arbitrary text. I'm assuming that you don't have them in your trial document since you never handle them, but they are possible and must be removed before you can handle anything else. This also introduces the need to tokenize, as you must extract something from your document, but keep a placeholder in there so you know where your content came from. As who knows what's in the document, we'll need to pick something that can't possibly be legal XML, but that we can work around in our regular expression. How about <<#>>, where # is the index in our token array. Note that since comment delimiters are not special within a CDATA block and vice versa, we must strip them simultaneously. So:

my @tokens;

while ($sandboxxml =~ /<!\[(CDATA)\[|<!--/) {
    if ($1) { # We're in a CDATA block
        $sandboxxml =~ s/<!\[CDATA\[(.*?)\]\]>/'<<' . (0+@tokens) . '>
+>'/es;
        push @tokens, $1;
    } else { # Comment
        $sandboxxml =~ s/<!--.*?-->//s;
    }
}
[download]

Note we're just dropping comments, that if the file isn't well-formed, we just created an infinite loop, and lots of lovely escaping since [ and ] have special meaning in regular expressions.

Okay, now we can start actually dealing with tags. Because of how XML is structured, we need to work from the inside out; otherwise is very hard in a general regex to know if you've actually matched start and end tags. We also now need to keep track of a tree structure in some way, but fortunately we can do that in a soft way using the tokens array we've already started.

while ($sandboxxml =~ s#(<[^<>]*(?:/|>(?:[^<>]|<<\d*>>)*</[^<>]*)>)#'<
+<' . (0+@tokens) . '>>'#es) {
    push @tokens, $1;
}
[download]

Of course, that's a giant mess. We also haven't built our tree up yet and failed to handle the leading <?xml...> tag. And hundred other things. And if our expressions are that complex, debugging them is going to be a pain.

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

In reply to Re: pattern match screwed up!! by kennethk
in thread pattern match screwed up!! by mdfaizy

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.