in reply to Re^2: The mostly used xml parser
in thread The mostly used xml parser

A month ago I'd not have mentioned XML::Twig - XML/HTML::TreeBuilder was my hammer. Having seen mention of XML::Twig] so frequently I've started using it too and it does do filtering (my first application for it) very nicely!

What I'd like would be an HTML::Twig. Might not be as "easy" to write though :).


Perl is Huffman encoded by design.

Replies are listed 'Best First'.
Re^4: The mostly used xml parser
by mirod (Canon) on Oct 05, 2005 at 20:50 UTC

    Processing HTML is not too difficult, as long as tidy is around:

    perl -MXML::Twig -e'open( my $fh, "tidy -asxml -quiet pm.html 2>/dev/null| ") or die $!; XML::Twig->parse( $fh)'

    Keeping the output similar to the input is of course much harder, as in this case XML::Twig does not see the original file.

    Here I pay for the fact that XML::Twig does not accept a SAX stream as input, or I could use XML::LibXML::SAX and get HTML parsing for free (SAX was quite new when I started writing XML::Twig, and now it is coupled very strongly with XML::Parser).

      Excellent point - using tidy as a preprocessor goes a long way to bridging the gap. It is also interesting to note that there is a tidy.pm available, although the Active State ppm install of it doesn't manage to hook the documentation into Active State's Perl documentation :(.


      Perl is Huffman encoded by design.

        It must be from HTML::Tidy, which is an HTML checker: it doesn't return the XHTML generated by tidy, just the error messages.

        I could provide a sortcut for using tidy (or xmllint, which can do the same thing) though, provided it is either in the path or you give the path to the executable. I'll add this to the next version, thanks for the idea.