Let's presume you have a well-formed_document, and ignore the question as to whether it's valid for a particular XSD.
The first mistake you are making is thinking about an XML document's line structure as significant. While newlines and indentation are considered good form in an XML document, the standard is whitespace agnostic. Thus, you should be doing a slurp into a single variable. Something like:
Note that by having an indirect file handle in the do where I localize $/, the file is automatically closed once I'm done with it.#!/usr/local/bin/perl use strict; use warnings; my $sandboxxml = do { open(my $fh, '<', $ARGV[0]) || die("sandbox xml file cannot be loa +ded;check for file name or existance"); local $/; # Slurp <$fh>; };
Second, comments can contain all sorts of text that might interfere with a parse. As well, an XML document may contain a CDATA block, which can contain very nearly arbitrary text. I'm assuming that you don't have them in your trial document since you never handle them, but they are possible and must be removed before you can handle anything else. This also introduces the need to tokenize, as you must extract something from your document, but keep a placeholder in there so you know where your content came from. As who knows what's in the document, we'll need to pick something that can't possibly be legal XML, but that we can work around in our regular expression. How about <<#>>, where # is the index in our token array. Note that since comment delimiters are not special within a CDATA block and vice versa, we must strip them simultaneously. So:
Note we're just dropping comments, that if the file isn't well-formed, we just created an infinite loop, and lots of lovely escaping since [ and ] have special meaning in regular expressions.my @tokens; while ($sandboxxml =~ /<!\[(CDATA)\[|<!--/) { if ($1) { # We're in a CDATA block $sandboxxml =~ s/<!\[CDATA\[(.*?)\]\]>/'<<' . (0+@tokens) . '> +>'/es; push @tokens, $1; } else { # Comment $sandboxxml =~ s/<!--.*?-->//s; } }
Okay, now we can start actually dealing with tags. Because of how XML is structured, we need to work from the inside out; otherwise is very hard in a general regex to know if you've actually matched start and end tags. We also now need to keep track of a tree structure in some way, but fortunately we can do that in a soft way using the tokens array we've already started.
while ($sandboxxml =~ s#(<[^<>]*(?:/|>(?:[^<>]|<<\d*>>)*</[^<>]*)>)#'< +<' . (0+@tokens) . '>>'#es) { push @tokens, $1; }
Of course, that's a giant mess. We also haven't built our tree up yet and failed to handle the leading <?xml...> tag. And hundred other things. And if our expressions are that complex, debugging them is going to be a pain.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
In reply to Re: pattern match screwed up!!
by kennethk
in thread pattern match screwed up!!
by mdfaizy
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |