marvell has asked for the wisdom of the Perl Monks concerning the following question:

I promise, I've already been through the Super Search route, but I have to wonder about the validity and up-to-dateness of nodes in the present wave of XML fever.

In a nutshell, all I want to do is check if some XML is well formed. I don't want to know anything about it, just see if it's well formed.

The background is that I have 20,000 hand written HTML files from which I have stripped the "useful" data. This comes to in the form of a snippet of HTML. Now, the client has come back informing me that they want it in XML, or at least a snippet of well formed XML.

XML::Parser with no handlers seemed a good plan, but then, it can only be used once per instance and croaks of the XML is not well formed.

OK, so can wrap it up in eval, but then, is all seems a bloated. But then, I'm not really in a position to comment, on the performance overhead.

Another plan was to preconvert the HTML to XHTML, but that looks to take ages.

You wisdom would be appreciated.

--
Brother Marvell

Replies are listed 'Best First'.
Re: well formed xml
by davorg (Chancellor) on Feb 28, 2001 at 20:34 UTC

    In this case I think that an XML::Parser call wrapped in an eval is your best bet. Something like this:

    foreach (@list_of_20_000_files) { my $p = XML::Parser->new; eval { $p->parsefile($_) }; if ($@) { print "$_ is bad\n"; } else { print "$_ is good\n"; } }

    The one thing that worries me tho' is that you talk about 'snippets' of well-formed XML. If those snippets don't have one enclosing element, then they won't be well-formed.

    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

      I will be wapping the string in a dummy tag.

      --
      Brother Marvell

Re: well formed xml
by mirod (Canon) on Feb 28, 2001 at 20:41 UTC

    Contrary to popular belief, wrapping a call to XML:Parser in an eval, provided it's a block eval, is completely kosher and does not inflict a huge (or actually any) performance penalty. In any case the actual parsing takes much longer than eval-ing a short string even if you choose to use the string version of eval. You can have a look at the review for a way to do this.

    Now if you are trying to check tons of files you might want to use James Clarke sp. James Clarke is the guy who wrote the expat library on which XML::Parser is based. He also wrote part of XML-the-spec, groff and way to many things for me to list here... The interesting part is that sp is an SGML parser, which means that you will have to set a couple of environment variables to get it to parse XML (quite easy to do) but also that it will not stop at the first error in the file an try its best to find other errors. It will also probably be faster than using XML::Parser.

Re: well formed xml
by ZZamboni (Curate) on Feb 28, 2001 at 20:28 UTC
    You could try XML::Simple. You don't have to define any handlers, just give it the text, and it will parse it. It will also die in case of malformed input, but in this case I think it's a legitimate application of eval. See the "Error handling" section of the XML::Simple docs.

    --ZZamboni

Re: well formed xml
by sierrathedog04 (Hermit) on Feb 28, 2001 at 21:47 UTC
    A fast Perl way to test if XML is well-formed (as opposed to validated) is to use Cooper and Wall's venerable XML::Parser::Expat.

    Its documentation states:

    setHandlers(TYPE, HANDLER [, TYPE, HANDLER ...])

    This method registers handlers for the various events. If no handlers are registered, then a call to parsestring or parsefile will only determine if the corresponding XML document is well formed (by returning without error.) This may be called from within a handler, after the parse has started.
    Reviews of expat often state that one of its advantages is that it is fast. (For the record, expat is in C rather than in Perl.)

    It is hard to imagine Larry Wall, who wrote version 1.0 of XML::Parser::Expat in order to provide "lowlevel access to James Clark's expat XML parser," bloating his code. Borrowing Davorg's suggested solution, the test would then be:

    $parser = new XML::Parser::Expat; $parser->setHandlers(); open(FOO, 'info.xml') or die "Couldn't open"; eval { $parser->parse(*FOO) }; if ($@) { print "$_ is bad\n"; } else { print "$_ is good\n"; } close(FOO);

      XML::Parser::Expat is just the lower level interface to Expat used by XML::Parser. It is no more venerable than XML::Parser.

      XML::Parser is an object factory. Every time an XML::Parser object calls its parse or parsefile method, it calls XML::Parser::Expat to create a new parser object. So it does just what you do. And if you don't need any handler there is no need to call setHandlers, there will be no handler set by default.

      As XML::Parser is kinda the official interface to XML::Parser::Expat, and although your code might be marginally faster, I would prefer a slightly improved version of davorg's code, where the creation of the XML object is pulled out of the loop:

      my $p = XML::Parser->new; # needs to be done only once foreach (@list_of_20_000_files) { eval { $p->parsefile($_) }; # creates a new XML::Parser::Exp +at object if ($@) { print "$_ is bad\n"; } else { print "$_ is good\n"; } }